Computing Reviews, the leading online review service for computing literature.

Search

Subjectively interesting alternative clusterings
Kontonasios K., De Bie T. Machine Learning98 (1-2):31-56,2015.Type:Article

Date Reviewed: Jun 11 2015

In clustering problems, there are many unknowns and many potential desirable sets of clusters. Any complex set of data contains patterns related to many potential questions, and there are likely many potential ways to cluster data for any question. A fundamental problem then is how to define a set of clusters, called here a clustering pattern, that addresses the question of interest and recovers the best set of clusters to address this question. The approach taken by Kontonasios and De Bie, following previous work by De Bie in 2010 and 2011 [1,2], is to define a measure of the goodness of a cluster driven by the joint likelihood under a normal probability model based on the mean and covariance of the observations associated with the cluster. The goodness-of-fit of the clustering pattern is then the joint likelihood across all clusters. This provides a method to compare different clustering patterns. The difficulties begin with the crucial question of which clustering pattern to prefer. As outlined here, the likelihood measure (here the log likelihood is termed the self information for a clustering pattern) is used to compare the addition of another cluster to the clustering pattern. Such additions will always improve the likelihood, as the fit to the data improves with the addition of parameters. Kontonasios and De Bie therefore require a best new clustering pattern and a way to stop the clustering process since the goodness-of-fit will improve indefinitely. For each new clustering pattern, they use spectral relaxation to choose the new clustering pattern that maximizes the improvement in self information relative to the projection onto the previous clusterings. For stopping, they suggest that the data miner decides when the information obtained in the present clustering pattern is adequate to the need of the project. The fundamental question is when this process is useful and when it is dangerous. For many data mining applications, the miner can conclude safely when adequate information is being captured, generally through use of a gold standard set and generation of precision-recall curves. In many data mining projects, the application of user-chosen stopping criteria with a good method to generate added clusterings, as done here, is valuable, as it provides a method to rapidly recover information. Where it is dangerous is the realm of fundamental research, where the user has a tendency to mine high-throughput data until “discovering“ what is already known and stop: a classic case of confirmation bias. In these cases, including molecular biology and automated astronomy, we still lack a fundamental measure to provide us with a satisfactory level of confidence that we have indeed mined the data adequately. Given a willingness to allow subjective criteria, one potential point of interest would be the imposition of constraints a priori, rather than “on the fly” as clusters are added. The review by Nelson and Cohen of probabilistic clustering methods where pairwise constraints can be added would be of interest here [3]. The miner could then require that certain observations cluster together, potentially guiding the method to an area of interesting clusters faster. As noted above, the approach of Kontonasios and De Bie uses Euclidean distance with a normal probability model as the metric. While this is a reasonable default, the assumption of the normal model and its associated independent identically distributed observations is violated in many domains. The use of more flexible metrics, such as mutual information [4], would be an interesting addition to the method proposed here, if the mathematics would allow the replacement of the normal likelihood with an information-based one.

Reviewer: Michael Ochs	Review #: CR143517 (1509-0822)

1)	De Bie, T. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Mining and Knowledge Discovery 23 (2010), 407–446.

2)	De Bie, T. An information-theoretic framework for data mining. In Proc. of the 17th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining (KDD '11). ACM, 2011, 564–572.

3)	Nelson, B.; Cohen, I. Revisiting probabilistic models for clustering with pair-wise constraints. In Proc. of the 24th International Conference on Machine Learning (ICML '07). ACM, 2007, 673–680.

4)	Slonim, N.; Atwal, G. S.; Tkacik, G.; Bialek, W. Information-based clustering. Proc. Nat. Acad. Sci. 102 (2005), 18297–18302.

Clustering (I.5.3 )

Clustering (H.3.3 ... )

Information Search And Retrieval (H.3.3 )

Pattern Recognition (I.5 )

Would you recommend this review?

yes

Other reviews under "Clustering":	Date

On the convergence of “A self-supervised vowel recognition system” Pathak A., Pal S. Pattern Recognition 20(2): 237-244, 1987. Type: Article	Aug 1 1988

Conceptual clustering of structured objects: a goal-oriented approach Stepp R., Michalski R. (ed) Artificial Intelligence 28(1): 43-69, 1986. Type: Article	Sep 1 1986

The enhanced LBG algorithm Patané G., Russo M. Neural Networks 14(9): 1219-1237, 2001. Type: Article	Apr 2 2003

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy