A method to improve clustering ensembles of datasets, called WETU, is presented in this paper. The current clustering ensemble methods use measurements, such as the weighted connection-triple (WCT), the weighted triple-quality (WTQ), and the combined similarity measure (CSM), which combines WCT and WTQ, to quantify the relations among data points within a cluster. The proposed method additionally considers the relations among the clusters, so that the resultant clusters are more accurate, stable, and meaningful.
The general method for clustering ensembles works as follows. A base algorithm, such as k-means, is used first to cluster the raw data with different sets of initialization conditions, each of which results in a different collection of clusters. Then, a link analysis is performed among the resultant collection of clusters. Here “links” refer to the relations among the clusters with the same initial condition (one run) and with different conditions (different runs). Within a run, because of the hard clustering (that is, one data point belongs to only one cluster), there is no explicit link among different clusters. WETU measures the similarity between two clusters based on their common neighboring clusters in different runs.
Specifically, WETU measures the relation between any two clusters X and Y as a fraction f(X,Y,Z)/g(X,Y,Z), assuming cluster Z has links to both X and Y. The enumerator contains as a main factor the number of weighted links between (X,Y), and (Y,Z). The denominator measures the weighted links between Z and the rest of the collection. The larger the value of the enumerator, the more “common” elements between X and Y; the larger the denominator, the lesser the contribution of cluster Z in terms of the commonality between X and Y. The novelty of WETU is its ability to measure commonality based on the neighbors of clusters that do not have direct common points.
The authors used six datasets, two synthetic and four real, to compare the different methods. The sizes of datasets vary from 150 data points to 2,500 data points, with a range of 10 to 60 features. The methods being compared include k-means clustering (KMC), base clustering I, CSM + global k-means clustering (CSM+GKMC), WTU+GKMC, and WETU+GKMC. The measurements of comparison are clustering accuracy (CA) and normalized mutual information (NMI). All results indicate that WETU outperforms the other methods.
In summary, this work introduces a new and effective method to manage clustering ensembles. WETU offers a different perspective for researchers in the clustering ensemble area. The contribution is significant, but the writing of the paper could have been improved to more effectively convey the information.