The rapidly growing field of data mining has developed a number of distinct techniques. The earliest was cluster analysis, finding items in the database that are similar to one another, soon followed by classification, which developed methods for putting labels on data items. A third area, outlier analysis or anomaly detection, is widely used for detecting spurious items. The most recently developed methods are for frequent pattern mining. These include two broad sets of techniques that correspond to reasoning across the rows of a database and reasoning across the columns. The older of the two, association rule mining, reasons across columns and seeks to identify sets of features that are often shared by multiple entries in the database. More recently, attention has been drawn to sequence mining, which implies some classification of rows and reasons across rows to find repeated sequences of types.
This multiauthor volume offers a thorough review of methods in frequent pattern mining. The first chapter introduces the field, and the last surveys example applications. The other 12 chapters focus on specific algorithms, constraints, and data types, providing the reader with a definitive snapshot of the state of the field and a rich source of pointers to the literature.
Chapter 2 offers a survey of frequent pattern mining algorithms. It helps the reader understand the combinatorial challenges to finding frequent patterns and shows the richness of work in the field. It could be even more useful if it gave the reader concise guidance as to when each variant that it discusses is appropriate.
The remaining chapters deal with three different challenges and extensions to data mining: computational complexity, defining interesting patterns, and extending the paradigm from within-record patterns to those that span records and other data.
The original techniques for association rules generated and then pruned candidates systematically. Chapter 3 discusses ways to grow patterns from existing examples, reducing the number of candidates that need to be explored. The combinatorial explosion is particularly problematic for long patterns, discussed in chapter 4. The explosion becomes even worse if one wants to find patterns not just of which features occur, but of which features do not occur, the subject of chapter 6. The earlier pattern mining algorithms require repeated passes through the database, and chapter 9 reviews newer methods that are applicable to streaming data. Chapter 10 discusses problems particular to extremely large datasets.
Chapters 5, 7, and 8 deal with the question of what makes one pattern more interesting than another. The approach in chapter 5 focuses on individual patterns. Chapter 7 discusses how user-specified constraints can be used to focus pattern generation, while chapter 8 argues that the overall set of patterns selected should be chosen based on their ability to summarize the data most economically.
Chapter 11 moves from reasoning across the columns in the database, to reasoning across rows in order to detect frequent sequences. Chapter 12 expands the scope yet further to include spatiotemporal data, and chapter 13 reviews work in finding frequent patterns in data that is structured as a graph rather than as a collection of records.
The editors have done a good job of focusing the individual authors’ chapters to provide reviews of their topics rather than detailed discussion of their own research. However, the volume does suffer from some common weaknesses of edited volumes. These include separate bibliographies for each chapter rather than a unified bibliography, unfortunate differences in mathematical notation, and a tendency for each author to reintroduce the field of frequent pattern mining in spite of the thorough introduction in the first chapter. But these shortcomings are minor compared to the wealth of information that the book brings together. This volume will be an essential reference for both researchers and practitioners in data mining.