Topic modeling and text analysis are important tasks of any methodology for hierarchical analysis of document collections. Text categorization, text compression, and text summarization are well-known applications, but predicted structures obtained through machine learning can also be used in computational biology, image understanding, and language modeling for speech recognition, to name just a few examples.
During the last decade, researchers in machine learning proposed parametric and nonparametric models. In this paper, the authors describe a new algorithm for building a topic hierarchy of a document collection. The authors propose an unsupervised learning method based on the nested Chinese restaurant process (nCRP) and Bayesian nonparametric (BNP) inference, and analyze the method using three different collections of documents from different domains.
The paper consists of seven sections, with more than 70 references. Section 1 presents the state of the art related to the scientific context for solving the mentioned problem, and Section 2 describes stochastic process theory and BNP statistics. Readers should be familiar with the Dirichlet and beta distributions, the CRP, the Pólya urn distribution, the stick-breaking processes, the Griffiths-Engen-McCloskey (GEM) distribution, and their connections to random partitions on integers and flexible clustering strategies.
Section 3 introduces nCRP and discusses “how similar ideas can be used to define a probability distribution on infinitely deep, infinitely branching trees.” In Section 4, based on nCRP, the authors extend the latent Dirichlet allocation (LDA) topic model to the hierarchical version (hLDA). Such a generative process defines a probability distribution across possible corpora of documents. Related work is discussed at the end of the section.
Next, the authors develop an algorithm based on Markov chain Monte Carlo (MCMC) and Gibbs sampling. Section 5 analyzes the algorithm using level allocation, the path, hyperparameters, convergence, and the mode. Section 6 presents examples and empirical results, for both simulated and real text data. Section 7 is dedicated to discussions and conclusions on defining prior distributions of trees, based on nCRP and hLDA, and developing a BNP methodology for analyzing document collections.
While the content of the collections varies significantly, the statistical principles behind the proposed model allow for the recovery of meaningful sets of topics at multiple levels of abstraction, using trees.
Although this paper is a valuable contribution to the field, there is a need for further work on hyperparameters and convergence speed. The paper is long, with some misprints and omissions, and the list of references does not meet the journal’s typical standards.