Model-based data analysis is a powerful, increasingly popular tool for inferring knowledge and abstract information from data sets. To date, the model-based learning literature has mostly been dominated by Gaussian mixtures. However, conventional use of Gaussian mixtures may not always be appropriate, especially when the data partitions are not Gaussian. In such a case, as the authors reported in their previous work, “the inverted Dirichlet mixture model and generalized inverted Dirichlet mixture model [may] outperform [Gaussian mixtures] in terms of clustering accuracy.” In this paper, the authors propose a variational Bayesian framework of the infinite generalized inverted Dirichlet mixture with feature selection. They further investigate the capabilities of the framework through computational experiments using both synthetic and real data sets.
The reported variational inference framework “is to determine a distribution Q ... that approximates the true posterior distribution” of data. Q is selected from a restrictive family of distributions that can be factorized into disjoint tractable distributions. Although the model is a full Dirichlet process, the variational distribution in calculation is truncated with two variational parameters for the levels of truncation. Through the learning process reported as an algorithm in ten major steps, the two parameters are optimized iteratively. To examine the performance of the proposed approaches, computational experiments are carried out on both synthetic and real-world data. For the synthetic data sets, the algorithm can optimally select the parameters that lead to the actual clusters. Two real-world experiments are performed on visual scene data sets: classification and digit categorization. In both, the algorithm is capable of assigning different weights to features that reflect their significance in the clustering process. The feature selection process brings noticeable improvements in clustering accuracy in both experiments.
The authors claim that the proposed algorithm “can be used for any positive data and has promising applications in different areas that have [a] huge amount of data to be clustered and analyzed.” Researchers and practitioners in data science fields, especially with interests in model-based learning and clustering, should benefit from reading this paper. Feature selection strategies and accuracy improvements are critical in analyzing high-dimensional big data. In addition, the proposed approach provides an alternative to the popular Gaussian mixtures in model-based data analysis; however, performance comparisons are not reported in the paper.