Topic Modeling: Comparison of LSA and LDA on Scientific Publications
In: 2021 4th International Conference on Data Storage and Data Engineering, 2021-02-18
Online
unknown
Zugriff:
Our so-called information society is characterized by an overabundance of information which results from a growing digitization. Generally this information is in the form of unlabeled text which we cannot strictly attribute to a thematic domain. This makes the task of setting a thematic vision for a collection of data a difficult challenge. Thus, it might be useful to resort to unsupervised algorithms to tackle the topic modeling problem. Such a modeling is interested in analyzing text to capture the sens of the terms with respect to their contexts in the natural language. In this paper, we conduct an empirical comparative study between two important topic modeling approaches, latent semantic analysis (LSA) and latent Dirichlet allocation (LDA). On one hand, the specificity of this work lies in the use of scientific publication corpus, knowing that scientific documents use very specialized vocabulary. We have used the scientific papers of the NIPS (Neural Information Processing Systems) conference. On the other hand, we have investigated the bi-gram collocation and the lemmatization on LSA and LDA. The obtained results, in As a programming language, we use Python. NLTK (Natural Language Toolkit) library [1] is used for natural language processing such as tokenization, stemming and stop words removal. Regarding topic modeling, we use Gensim, a robust and scalable open source library. It performs various tasks such as, building document representations and corpora, performing topic identification a terms of running time and topic coherence (Cv and UMass) are in favor of LDA. The lemmatization task positively influence the Cv coherence. Also the UMass coherence is insensitive to bi-grams and lemmatization.
Titel: |
Topic Modeling: Comparison of LSA and LDA on Scientific Publications
|
---|---|
Autor/in / Beteiligte Person: | Bellaouar, Slimane ; Mohammed Mounsif Bellaouar ; Issam Eddine Ghada |
Link: | |
Zeitschrift: | 2021 4th International Conference on Data Storage and Data Engineering, 2021-02-18 |
Veröffentlichung: | ACM, 2021 |
Medientyp: | unknown |
DOI: | 10.1145/3456146.3456156 |
Schlagwort: |
|
Sonstiges: |
|