Improving the Methodology of Automatic Text Analysis

Project leader: Sergei Koltcov

Project participants (from different times): Sergey Nikolenko, Konstantin Vorontsov, Murat Apishev, Vladimir Filippov, Maxim Koltsov, Vera Ignatenko, Anton Surkov

Topic modeling is a promising instrument for computational social science and digital humanities as it allows to automatically reveal the topic structure of large text collections – an immensely important task in the era of big Internet data. However, topic modeling has a number of problems that prevent its efficient use by social scientists, including social media analysts. First, it does not yield reproducible results and fluctuates greatly from one algorithm run to another. Second, it gives no clues on optimization of its parameters, such as the number of topics and other parameters of the model, and, third, there exist no reliable quality metrics that could be used for such optimization and for the assessment of the algorithm performance. A possible solution can be searched by the application of concepts from statistical physics.

This project represents LINIS constant effort aimed at solving these problems.

First, the project tests existing topic modeling quality metrics and seeks to develop new ones. It also develops approaches to metric testing and theoretical concepts of topic modeling quality and ground truth. One of the project’s publications proposes a metric of tf-idf coherence that performs better than ordinary coherence and is easily generalizable from evaluating a single topic to evaluating the entire solution.

Second, the project aims to regularize topic modeling algorithms so as to improve their stability. The team offers various approaches, such as sampling neighbor words from texts (gLDA – granulated LDA), seedword-based semi-supervised solutions (ISLDA – interval semi-supervised LDA) and experiments with additive regularization of pLSA (in collaboration with Konstantin Vorontsov’s team at HSE Moscow). In the current work, a new topic model with word embeddings (GLDAW) is proposed, which is extension of gLDA model demonstrating high stability and good coherence.

Third, the project develops methods to efficiently detect an optimal number of topics given that parameter optimization is a computationally intensive task. The project lays theoretical foundations for greedy algorithms based on the concepts from thermodynamics, such as non-extensive entropy and free energy. This approach allows to look at the problem of ambiguity of stochastic decomposition from a new angle and to formulate the task of topic number optimization in terms of finding the entropy minimum and the information maximum. One of the project’s main publications proposes a new metric based on Renyi entropy for determining the optimal number of topics (Koltcov, 2018). Further, this metric was applied in works ( Koltsov S. , Ignatenko V., Boukhers Z., Staab S. Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi entropy // Entropy. 2020. Vol. 22. No. 4. P. 1-13.; Koltcov Sergei , Ignatenko V. Renormalization Analysis of Topic Models // Entropy. 2020. Vol. 22. No. 5. P. 1-23.; Koltsov S. , Ignatenko V., Terpilowski M., Rosso P. Analysis and tuning of hierarchical topic models based on Renyi entropy approach // PeerJ Computer Science. 2021. Vol. 7. Article e608.)

Fourth, the project team invests a lot of effort in developing and maintaining TopicMiner, a GUI-based research software for topic modeling. While freeing researchers from coding and scripting, the software allows them to concentrate on substantial topic modeling tasks: first, it lets the computer and NLP scientists to quickly apply and evaluate various models, and second, it lets social scientists and humanities scholars to efficiently examine and interpret results of topic modeling. In its current version, TopicMiner implements basic pLSA, LDA (E-M algorithm and Gibbs sampling), BigARTM-based models, a number of quality metrics and modeling progress visualization. It also contains a user-friendly preprocessing module and a module to work with output (visualization, scrolling through and sorting millions of documents, and output export).

Download TopicMiner software

Download TopicMiner manual (Russian)

Publications:

Ignatenko V., Surkov A., Koltcov S. Random forests with parametric entropy-based information gains for classification and regression problems. 2024 // PeerJ Computer Science 10:e1775 https://doi.org/10.7717/peerj-cs.1775
Koltcov S., Surkov A., Filippov V., Ignatenko V. Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics. 2024 // PeerJ Computer Science 10:e1758 https://doi.org/10.7717/peerj-cs.1758

Koltsov S., Ignatenko V., Terpilowski M., Rosso P. Analysis and tuning of hierarchical topic models based on Renyi entropy approach // PeerJ Computer Science. 2021. Vol. 7. Article e608.
Koltcov Sergei, Ignatenko V. Renormalization Analysis of Topic Models // Entropy. 2020. Vol. 22. No. 5. P. 1-23.
Koltsov S., Ignatenko V., Boukhers Z., Staab S. Analyzing the Influence of Hyper-parameters and Regularizers of Topic Modeling in Terms of Renyi entropy // Entropy. 2020. Vol. 22. No. 4. P. 1-13.
Koltsov, S., Ignatenko, V., Koltsova, O. (2019). Estimating Topic Modeling Performance with Sharma–Mittal Entropy. Entropy, Vol. 21, No. 7., p. 660 doi: 10.3390/e21070660
Koltcov, S. (2018). Application of Rényi and Tsallis entropies to topic modeling optimization. Physica A: Statistical Mechanics and Its Applications , 512 , 1192–1204. https://doi.org/10.1016/j.physa.2018.08.050
Ignatenko, V., Koltcov, S., Staab, S., & Boukhers, Z. (2019). Fractal approach for determining the optimal number of topics in the field of topic modeling. Journal of Physics: Conference Series. Vol. 1163, No. 1, pp. 1- 6. doi: 10.1088/1742-6596/1163/1/012025 Download preprint version
Koltcov S. N., A thermodynamic approach to selecting a number of clusters based on topic modeling, Technical Physics Letters, 43(6), 584-586.
Koltsov S., Nikolenko S. I., Koltsova O. Gibbs Sampler Optimization for Analysis of a Granulated Medium // Technical Physics Letters . 2016. Vol. 8. No. 42. P. 837-839
Apishev M., Koltsov S., Koltcova E. Y. Mining ethnic content online with additively regularized topic models // Computacion y Sistemas . 2016. Vol. 20. No. 3. P. 387-403
Sergei Koltcov, Nikolenko S. I., Olessia Koltsova, Vladimir Filippov, Svetlana Bodrunova. Stable Topic Modeling with Local Density Regularization, in: Internet Science, Proc. of 3d conf INSCI 2016, Lecture Notes in Computer Science series Vol. 9934.Switzerland : Springer, 2016
Koltsov S., Nikolenko S. I., Koltsova O., Bodrunova S. Stable topic modeling for web science: Granulated LDA, in: WebSci 2016 - Proceedings of the 2016 ACM Web Science Conference . Elsevier B.V., 2016. P. 342-343.
Sergey Nikolenko, Sergei Koltcov, Olessia Koltsova. Topic modelling for qualitative studies // Journal of Information Science . 2015
Koltsov S., Koltsova O., Nikolenko S. I. Latent Dirichlet Allocation: Stability and Applications to Studies of User-Generated content, in: Proceedings of WebSci '14 ACM Web Science Conference, Bloomington, IN, USA — June 23 - 26, 2014 . NY : ACM, 2014. P. 161-165.
Nikolenko S. I., Koltsov S., Koltsova O. Measuring Topic Quality in Latent Dirichlet Allocation, in: Proceedings of the Philosophy, Mathematics, Linguistics: Aspects of Interaction 2014 Conference . St. Petersburg : The Euler International Mathematical Institute, 2014. P. 149-157.
Bodrunova S., Nikolenko S. I., Koltcova E. Y., Koltsov S., Shimorina A. Interval Semi-Supervised LDA: Classifying Needles in a Haystack, in: Proceedings of the 12th Mexican International Conference on Artificial Intelligence (MICAI 2013) Part I: Advances in Artificial Intelligence and Its Applications. Berlin : Springer Verlag, 2013. P. 265-274.

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!
To be used only for spelling or punctuation mistakes.

Laboratory for Social and Cognitive Informatics

Improving the Methodology of Automatic Text Analysis