This is a textbook about data mining and its application to the Web. The first part of the book covers core data mining and machine learning concepts. These include associative rules that discover correlations among data items and sequential patterns, where ordering also matters. Along with the models, Liu describes key algorithms such as apriori, generalized sequential pattern (GSP), and their variations.
The book then discusses supervised learning, where the algorithms use a set of training data to derive a classification function that is then applied. This is followed by unsupervised learning algorithms that attempt to cluster data into similar subsets without prior training, and partially supervised learning algorithms, where a small training sample is combined with a large set of input data.
The second part of the book relates these data mining concepts to Web mining--beginning with search; the author covers concepts such as relevance ranking, relevance feedback, preprocessing of Web pages, inverted indexes, compression, and meta-search engines. The subject of search is followed by a discussion of link analysis that uses hyperlinks for page evaluation and ranking. Some of the concepts covered include prestige, citation analysis, Google’s PageRank algorithm, and the hypertext induced topic search (HITS) algorithm.
The book then describes issues around Web crawlers. Topics covered include parsing, link extraction, coverage, freshness, and different types of crawlers. The book concludes with chapters on extracting structured information, information integration, and opinion and usage mining.
Liu succeeds in helping readers appreciate the key role that data mining and machine learning play in Web applications. Most readers are familiar with search, but this book really highlights the broad role that machine learning plays when applied to such fields as data extraction and opinion mining. This is important, as it gives people a better idea of what is possible, and points to related areas where these concepts can be further applied. It also motivates the student by adding immediacy and relevance to the concepts and algorithms described.
I like the way the concepts are introduced in a stepwise manner. For example, the author starts with the apriori algorithm, and then describes issues that motivate various refinements. The chapters also include many examples that point intuitively to what the algorithms are modeling. I also appreciated the bibliographical notes at the end of each chapter. They give more context to how and when these algorithms were developed, which helps one appreciate the dynamism of the field.