Zhang et al. propose a semi-supervised clustering algorithm, called SSCGD, addressing a specific class of such techniques: probabilistic clustering. The algorithm optimizes a given Gaussian mixture model (GMM) by adding, on the one hand, more probabilistic information, and on the other, knowledge coming from the geometrical organization of the labeled/unlabeled elements in the training set. Thereby, the authors adapt the original objective function to contain Kullback-Leibler divergences among model components and weighted distance measurements among training elements. The estimation maximization algorithm is applied to deduce new parameters of the model. The work relates to earlier attempts to refine the GMM structure by altering its objective function, such as Laplacian regularized GMM (LapGMM) and local consistent GMM (LCGMM), or to those based on Jensen-Shannon divergence. Most of them are evoked in the introductory part, which surveys the state of the art in the design of supervised or hybrid clustering techniques.
The experimental section evaluates the SSCGD algorithm against sheer GMM and k-means methods, and against semi-supervised algorithms such as PCK-Means or transductive support vector machine (T-SVM), in the context of varied rates of labeled data. The evaluation procedure is based on an adapted F1 formula. The authors use real-world experimental datasets; one, the Chinese Word Sense Induction, is fully labeled. It would be interesting to know how the labeled data can be obtained in the case of unlabeled datasets.
Besides some careless formulations (for instance, the experimental results “indicate that the SSCGD algorithm to integrated distance metric and Gaussian mixture model in clustering can lead to improvements in cluster quality”), the work demonstrates solid grounding and keen investigation of new facets of clustering structure, representing a worthy attempt to enhance classification techniques. These are good reasons for pattern recognition researchers to try it.