Search | VHL Regional Portal

Pseudo-document simulation for comparing LDA, GSDMM and GPM topic models on short and sparse text using Twitter data.

Weisser, Christoph; Gerloff, Christoph; Thielmann, Anton; Python, Andre; Reuter, Arik; Kneib, Thomas; Säfken, Benjamin.

Comput Stat ; 38(2): 647-674, 2023.

Article in English | MEDLINE | ID: mdl-37223721

ABSTRACT

Topic models are a useful and popular method to find latent topics of documents. However, the short and sparse texts in social media micro-blogs such as Twitter are challenging for the most commonly used Latent Dirichlet Allocation (LDA) topic model. We compare the performance of the standard LDA topic model with the Gibbs Sampler Dirichlet Multinomial Model (GSDMM) and the Gamma Poisson Mixture Model (GPM), which are specifically designed for sparse data. To compare the performance of the three models, we propose the simulation of pseudo-documents as a novel evaluation method. In a case study with short and sparse text, the models are evaluated on tweets filtered by keywords relating to the Covid-19 pandemic. We find that standard coherence scores that are often used for the evaluation of topic models perform poorly as an evaluation metric. The results of our simulation-based approach suggest that the GSDMM and GPM topic models may generate better topics than the standard LDA model.

Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling.

Thielmann, Anton; Weisser, Christoph; Krenz, Astrid; Säfken, Benjamin.

J Appl Stat ; 50(3): 574-591, 2023.

Article in English | MEDLINE | ID: mdl-36819086

ABSTRACT

Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL