Under hierarchical clustering, a dataset of n points is recursively partitioned into 2, 3, ... k clusters. Typically, at each recursion step, a cluster is divided into two or more clusters based on a cost function (such as the popular k-means). In this context, two important issues arise in hierarchical clustering: Which is the most appropriate cost function? How can the overall cost of producing k clusters be estimated or approximated?
This paper proposes the maximum radius of clusters as the cost function, and introduces an approximation algorithm for hierarchical clustering. This approximation algorithm favors the choice of the point furthest from a cluster center so as to make it a new cluster center. It then assigns points to clusters by an appropriate numbering scheme and an associated parent function, which preserves the different levels of granularity. The idea proposed here is sufficiently justified by theorems and lemmas that are proven in the paper. Moreover, the paper highlights interesting open questions, and the references used are adequate and closely related to the paper’s content.
This paper is of high interest for researchers working in data mining, since the proposed approximation algorithm refers to a generic dataset of points, and it shows performance guarantees for every k clustering. Scientists with a strong mathematical background will be able to follow the paper. It could also serve as reading material on data mining in graduate courses.