In clustering problems, there are many unknowns and many potential desirable sets of clusters. Any complex set of data contains patterns related to many potential questions, and there are likely many potential ways to cluster data for any question. A fundamental problem then is how to define a set of clusters, called here a clustering pattern, that addresses the question of interest and recovers the best set of clusters to address this question.
The approach taken by Kontonasios and De Bie, following previous work by De Bie in 2010 and 2011 [1,2], is to define a measure of the goodness of a cluster driven by the joint likelihood under a normal probability model based on the mean and covariance of the observations associated with the cluster. The goodness-of-fit of the clustering pattern is then the joint likelihood across all clusters. This provides a method to compare different clustering patterns.
The difficulties begin with the crucial question of which clustering pattern to prefer. As outlined here, the likelihood measure (here the log likelihood is termed the self information for a clustering pattern) is used to compare the addition of another cluster to the clustering pattern. Such additions will always improve the likelihood, as the fit to the data improves with the addition of parameters. Kontonasios and De Bie therefore require a best new clustering pattern and a way to stop the clustering process since the goodness-of-fit will improve indefinitely.
For each new clustering pattern, they use spectral relaxation to choose the new clustering pattern that maximizes the improvement in self information relative to the projection onto the previous clusterings. For stopping, they suggest that the data miner decides when the information obtained in the present clustering pattern is adequate to the need of the project.
The fundamental question is when this process is useful and when it is dangerous. For many data mining applications, the miner can conclude safely when adequate information is being captured, generally through use of a gold standard set and generation of precision-recall curves. In many data mining projects, the application of user-chosen stopping criteria with a good method to generate added clusterings, as done here, is valuable, as it provides a method to rapidly recover information. Where it is dangerous is the realm of fundamental research, where the user has a tendency to mine high-throughput data until “discovering“ what is already known and stop: a classic case of confirmation bias. In these cases, including molecular biology and automated astronomy, we still lack a fundamental measure to provide us with a satisfactory level of confidence that we have indeed mined the data adequately.
Given a willingness to allow subjective criteria, one potential point of interest would be the imposition of constraints a priori, rather than “on the fly” as clusters are added. The review by Nelson and Cohen of probabilistic clustering methods where pairwise constraints can be added would be of interest here [3]. The miner could then require that certain observations cluster together, potentially guiding the method to an area of interesting clusters faster.
As noted above, the approach of Kontonasios and De Bie uses Euclidean distance with a normal probability model as the metric. While this is a reasonable default, the assumption of the normal model and its associated independent identically distributed observations is violated in many domains. The use of more flexible metrics, such as mutual information [4], would be an interesting addition to the method proposed here, if the mathematics would allow the replacement of the normal likelihood with an information-based one.