Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Data clustering: theory, algorithms, and applications
Gan G., Ma C., Wu J., SIAM, Philadelphia, PA, 2020. 406 pp. Type: Book (978-1-611976-32-8)
Date Reviewed: Mar 31 2022

Data clustering is an unsupervised method of grouping data such that objects in the same cluster are similar and objects in different clusters are distinct. Such techniques have a very diverse span of applicability in areas such as artificial intelligence, image processing, biology, and so on.

This second edition introduces concepts, algorithms, software packages, and applications related to data clustering. Though the authors do not claim comprehensive coverage of the field, the breadth and depth of the material presented make it a very valuable textbook as well as reference.

The organization of the book is very good. There are 22 chapters spread over four parts. The first part (six chapters) covers the basics, the second part (12 chapters) describes the various algorithms, the third part (two chapters) reviews the available open-source software, and the fourth part considers two example applications with results. Since the material is self-contained, readers familiar with the basics of linear algebra and statistics will be able to easily follow the content.

Data variables may not be measured on the same scale, and in some cases the numerical data may need to be converted to nominal scales. In addition, when a dataset involves different types of variables, transformation into a certain type is needed, giving appropriate weight to the variables. Such considerations along with methods of data visualization are covered in the first part. Every clustering algorithm is based on some measure of similarity or dissimilarity, hence several measures of similarity and dissimilarity are discussed in detail here.

Clustering problems can be generally divided into hard clustering or fuzzy clustering. In hard clustering, each data point belongs to one and only one cluster; in fuzzy clustering, a data point may belong to two or more clusters with some probabilities. Hard clustering can be hierarchical or partitional. However, “unlike hierarchical algorithms, partitional algorithms create a one-level non-overlapping partitioning of the data points” [1].

Special tree structures called dendrograms are often used in hierarchical clustering and are described very well with several examples. Hierarchical algorithms are subdivided into agglomerative or divisive algorithms. Agglomerative methods start with each object in a different cluster and then repeatedly merge the closest pair of clusters according to some similarity criteria. Divisive methods start with all objects in one cluster and repeatedly split large clusters into smaller pieces. Both methods are susceptible to variance in the quality of results based on similarity measures adopted. Also, any incorrect grouping decision in an earlier step affects the clustering quality. The book covers the relative merits of such methods as well as providing pointers to earlier publications that suggest improvements.

The second part of the book describes various hierarchical as well as fuzzy algorithms in depth. There is a vast range of algorithms here that are center-based, search-based, graph-based, grid-based, density-based, and model-based. High-dimensional datasets have inherent sparsity; conventional clustering methods do not scale well on them. In addition, clusters are embedded in subspaces of high-dimensional data space. Dimension reduction, the handling of big data, and the evaluation of clustering algorithms are covered in the later chapters.

The third part of the book reviews available open-source packages in R and Python. It also presents a Java-based clustering framework. The fourth part of the book explores applications like gene expression data and the valuation of large variable annuity portfolios with metamodeling.

Overall, the presentation style is very pleasant. A broad spectrum of data clustering models, methods, and metrics are elaborated, along with ample references, and thus the book meets the needs of students and practitioners, as well as researchers.

Reviewer:  Paparao Kavalipati Review #: CR147425
1) Dinh, D.-T.; Huynh, V.-N. k-CCM: a center-based algorithm for clustering categorical data with missing values. In: Modeling decisions for artificial intelligence (LNAI 11144). 267-279, Springer, 2018.
Bookmark and Share
  Reviewer Selected
Featured Reviewer
 
 
Clustering (H.3.3 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Clustering": Date
Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases
Can F. (ed), Ozkarahan E. ACM Transactions on Database Systems 15(3): 483-517, 1990. Type: Article
Dec 1 1992
A parallel algorithm for record clustering
Omiecinski E., Scheuermann P. ACM Transactions on Database Systems 15(3): 599-624, 1990. Type: Article
Nov 1 1992
Organization of clustered files for consecutive retrieval
Deogun J., Raghavan V., Tsou T. ACM Transactions on Database Systems 9(4): 646-671, 1984. Type: Article
Jun 1 1985
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy