Computing Reviews

Data mining and knowledge discovery handbook (2nd ed.)
Maimon O., Rokach L., Springer Publishing Company, Incorporated,New York, NY,2010. 1285 pp.Type:Book
Date Reviewed: 11/11/11

Data mining is the process of extracting hidden patterns and developing models from large datasets. Its main goal is to construct, in a computer-aided way, human-understandable descriptions of voluminous datasets, or at least to construct a model (for example, a decision tree or neural network) that extrapolates its behavior on training data to the whole dataset. Data mining is a complex inductive, interactive, and iterative process that applies methods from different branches of science--mainly statistics, artificial intelligence, and database management.

If the data comes from databases, data mining is the core phase of the knowledge discovery in databases (KDD) process. Its scope has been extended recently to much broader areas than structured numeric data--it can now handle large unstructured and textual data (for example, data coming from the Web). Special kinds of data can be treated, too, including multimedia data, streams, time series, temporal data, and spatial data. Its application areas have also extended over the last decade to include financial, marketing/commercial, health, industrial, security, and scientific areas.

This edition treats new aspects (for instance, privacy) and new methods, like those based on swarm intelligence and multi-label classification. Challenges to existing methods, algorithms, and implementations include the following: the volume of data to be handled efficiently increases enormously, and in some important fields such data is distributed.

The book is a comprehensive and detailed reference. Therefore, it can’t be concise--it’s almost 1,300 pages. Following the preface and an interesting introductory chapter, there are eight parts. Each part contains many chapters--altogether, there are 66 chapters, written by about 110 authors.

Chapters are sometimes loosely coupled, much like the proceedings of a research and development (R&D) conference. There are no connecting sections between chapters, nor introductions to and overviews of the different parts. Since the chapters are more or less self-contained, they very rarely refer to each other. A consequence of this is that different chapters repeat explanations for the same basic notions. However, since the explanations use different words and examples, the notions are often introduced in rich and colorful ways. Each chapter contains a long list of references for further investigation.

Part 1, “Preprocessing Methods,” covers data cleansing, handling missing attributes, reducing dimension, and detecting outliers. Part 2, “Supervised Methods,” is on classification (neural and Bayesian networks, decision trees, support vector machines, and instance-based classification) and regression. Part 3, “Unsupervised Methods,” discusses clustering, association rules, constraint-based mining, and link analysis. Part 4, “Soft Computing Methods,” covers evolutionary algorithms, reinforcement learning, neural networks, granular computing, swarm intelligence, and fuzzy logic. Part 5, “Supporting Methods,” presents chapters on statistics, logic, wavelet methods, fractal mining, visual analysis, interestingness measures, quality assessment, model comparison, and query languages. Part 6, “Advanced Methods,” includes chapters on multi-label data mining, privacy, meta-learning, mining different data types, ensemble and decomposition methods, parallel and grid-based data mining, and collaborative and organizational data mining. Part 7, “Applications,” discusses multimedia data mining, medical data, biological databases, financial data, intrusion detection, customer relationship management (CRM), and target marketing.

Finally, Part 8, the last part of the book, introduces some interesting software tools, in two chapters. Its first chapter broadly introduces a few commercial products for classical mining tasks, supercomputing, and text and Web mining, including SAS Enterprise Miner, IBM Enterprise Miner, and PASW (formerly SPSS Clementine). The first chapter also introduces the following newly developed tools: Megaputer PolyAnalyst, BioDiscovery GeneSight, Avizo by Visualization Sciences Group, JMP Genomics, and SAS Text Miner. The second chapter describes Weka, a sophisticated open-source and free machine learning workbench that is implemented in Java.

The book’s themes are treated mostly on a theoretical level. For relatively mature themes, the authors generally include methods and algorithms. Because of this treatment, readers who want to develop successful applications using this book need a lot of design experience--and, for many of the themes, additional research. The first three to four parts are more detailed than the rest of the book. In addition to the table of contents, a medium-sized index helps readers navigate the material.

Some changes were made to this revised and extended second edition. The already-existing chapters and references have been improved, and there are new chapters on the new data types, new treatments, new aspects of data mining (for example, privacy), and new application domains.

I recommend this comprehensive book to advanced readers--including designers and architects at software companies--interested in the R&D of data mining.

Reviewer:  K. Balogh Review #: CR139584 (1204-0348)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy