The article by S. Koltsov, A. Surkov, V. Filippov, and V. Ignatenko has been accepted for publication in the journal PEERJ Computer Science
The work "Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics" was prepared within the framework of the QTM project "Improving the Methodology of Automatic Text Analysis".
Topic modeling is a widely used instrument for the analysis of large text collections. In the last years, neural topic models and models with word embeddings were proposed to increase the quality of topic solutions. However, these models were not extensively tested in terms of stability and interpretability. Moreover, the question of selecting the number of topics (a model parameter) remains a challenging task. We aim to partially fill this gap by testing four well-known and available to a wide range of users topic models such as embedded topic model (ETM), Gaussian Softmax distribution model (GSM), Wasserstein autoencoders with Dirichlet prior (W-LDA), and Wasserstein autoencoders with Gaussian Mixture prior (WTM-GMM). We demonstrate that W-LDA, WTM-GMM, and GSM possess poor stability that complicates their application in practice. ETM model with additionally trained embeddings demonstrates high coherence and rather good stability for large datasets, but the question of the number of topics remains unsolved for this model. We also propose a new topic model based on granulated sampling with word embeddings (GLDAW), demonstrating the highest stability and good coherence compared to other considered
Research Fellow
Leading Research Fellow
Research Assistant