The first edition of this book, published in 2006 , was probably the best introductory textbook on data mining available. A dozen years later, the field has evolved to become mainstream under the commercial denomination of “big data,” or its more academic appellative “data science.” Due to the maturity of the field, this new edition can be considered as only a minor update to the already outstanding original book.
The book starts with a chapter on the scope, motivation, origin, and key tasks addressed by data mining techniques. A second chapter discusses data as the necessary raw material for data mining: the different kinds of attributes, quality issues, preprocessing techniques, and the measures of similarity (or dissimilarity) often used within a number of data mining algorithms. The introductory chapters cover roughly the first 100 pages of the book, and their sections are structured in exactly the same way they were 12 years ago.
The core of the book also mirrors the organization of the first edition. It is devoted to the four key data mining areas: classification, association, clustering, and anomaly detection. Given its introductory nature, the book describes the main techniques used for each task, provides the necessary details for understanding the basic algorithms for solving each problem, and includes pointers to further reading in the bibliographic notes at the end of each chapter, as well as some exercises (mostly unchanged from the first edition).
The two chapters on classification represent the more substantial changes in this edition. In particular, the section on artificial neural networks has been extended with a new section on deep learning, where readers will learn the key characteristics behind the most fashionable trend in machine learning and therefore data mining. In this section, readers will become acquainted with terms such as dropout, pretraining, and autoencoder. Obviously, other supervised learning techniques are also discussed, such as Bayesian networks and support vector machines (SVMs). Two final sections on ensemble methods (that is, bagging, boosting, and random forests) and dealing with unbalanced classification problems round up a nice 240-page overview of data mining techniques for supervised learning.
The two chapters on association analysis remain mostly unchanged from the 2006 edition. The introductory chapter includes an interesting section on the evaluation of association patterns, as well as insightful comments on indirect associations, a technique previously proposed by the authors. The chapter on “advanced concepts” covers how to handle continuous attributes in association mining and how to deal with different kinds of patterns. Namely, the authors address how to mine sequential, graph, and infrequent patterns (even though the latter are illustrated by the joint sale of DVDs and VCRs, products from a distant past for many current students). In fact, the topics are not really that advanced, at least for current graduate students. No bibliographic reference since 2015 is included, and only half a dozen since 2010, including two chapters from a 2014 monograph on frequent pattern mining .
The two chapters on clustering techniques describe the fundamental algorithms (that is, k-means, hierarchical clustering, and DBSCAN) and include a thorough discussion on how to evaluate clustering results using both internal and external measures. The second chapter, on additional issues and algorithms, explores prototype-based, density-based, graph-based, and scalable clustering techniques. It is basically a reprint of the first edition, except for a new subsection on spectral clustering within the section on graph-based clustering techniques.
The final key data mining area is anomaly detection. In the chapter devoted to anomalies, changes with respect to the first edition are more significant. Apart from the statistical, proximity-based, density-based, and clustering-based techniques discussed in the first edition, the current edition adds three new families of anomaly detection techniques: reconstruction-based approaches, one-class classification (basically, one-class SVMs), and information-theoretic approaches.
Whereas the first edition ends with the chapter on anomaly detection, this second edition finishes with a new chapter on avoiding false discoveries. How? Through statistical testing. After a few very readable pages describing significance and hypothesis testing, separate sections are devoted to guidelines and suggestions for incorporating statistical testing into the four key data mining areas (classification, association, clustering, and anomaly detection). I would rather read such comments interspersed within the chapters dealing with each particular area and its specific algorithms, although that would require including the statistical preliminaries in the book’s introductory chapters. Although this might initially deter some potential readers, especially those with a strong preference for algorithmic details, it could sidestep the problem that many computer science (CS) students may easily overlook this final yet important chapter.
As mentioned earlier, the authors have updated an excellent textbook on data mining. If I had to recommend a single data mining book for CS students, this one is still my preferred choice.
More reviews about this item: Goodreads