The article by S. Koltsov, A. Surkov, V. Filippov, and V. Ignatenko has been accepted for publication in the journal PEERJ Computer Science

The work "Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics" was prepared within the framework of the QTM project "Improving the Methodology of Automatic Text Analysis".

Topic modeling is a widely used instrument for the analysis of large text collections. In the last years, neural topic models and models with word embeddings were proposed to increase the quality of topic solutions. However, these models were not extensively tested in terms of stability and interpretability. Moreover, the question of selecting the number of topics (a model parameter) remains a challenging task. We aim to partially fill this gap by testing four well-known and available to a wide range of users topic models such as embedded topic model (ETM), Gaussian Softmax distribution model (GSM), Wasserstein autoencoders with Dirichlet prior (W-LDA), and Wasserstein autoencoders with Gaussian Mixture prior (WTM-GMM). We demonstrate that W-LDA, WTM-GMM, and GSM possess poor stability that complicates their application in practice. ETM model with additionally trained embeddings demonstrates high coherence and rather good stability for large datasets, but the question of the number of topics remains unsolved for this model. We also propose a new topic model based on granulated sampling with word embeddings (GLDAW), demonstrating the highest stability and good coherence compared to other considered

Date

27 November 2023

Authors

Vera Ignatenko
Research Fellow

Sergei Koltsov
Leading Research Fellow

Anton Surkov
Research Assistant

Topics

Research & Expertise

Keywords

publications research projects

About

Laboratory for Social and Cognitive Informatics