Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Mining of massive datasets
Rajaraman A., Ullman J., Cambridge University Press, New York, NY, 2011. 326 pp. Type: Book (978-1-107015-35-7)
Date Reviewed: Jul 30 2012

It has become commonplace to assert the growing importance of large datasets in modern information systems. Consequently, the demand for algorithms and methods that can deal with such data efficiently is increasing. However, there are still relatively few academic textbooks that address these issues in a cohesive manner. This book provides a timely and well-structured guide to some of the most important techniques developed in this area.

The focus of the book is on data mining (on large datasets) as opposed to machine learning. The distinction may strike the reader as somewhat arbitrary, given the degree of interaction between these two fields, but the authors justify it in terms of a focus on algorithms that can be applied directly to data. Although these include what is known in machine learning circles as “unsupervised learning,” the book draws most heavily on databases and information retrieval sources. The first two chapters cover the relevant concepts and tools from these main sources, along with preliminaries on statistical modeling and hash functions, the latter being pervasive throughout the book. The MapReduce programming model is naturally given a prominent place and is explained in great detail.

This introduction is followed by the book’s main topics, starting with a chapter on techniques for assessing the similarity of data items in large datasets. This covers the similarity and distance measures used in conventional applications, but with special emphasis on the techniques needed to render these measures applicable to large-scale data processing. This approach is nicely illustrated by the use of min-hash functions to approximate Jaccard similarity. The next chapter focuses on mining data streams, including sampling, Bloom filters, counting, and moment estimation.

The text then changes direction somewhat, with a chapter on the PageRank and HITS algorithms and their applications. Chapters 6 and 7 provide impressive coverage of related topics, such as finding frequent item sets and association rules, and clustering. The final chapters address targeted advertising and recommendation systems.

Overall the book is clearly written, remarkably free of buzzwords, and replete with useful examples. This abundance of examples and details makes for enjoyable reading while only rarely distracting from the flow of the presentation. The more advanced reader will quickly get a feel for what can be skipped without harm to comprehension. Prerequisites and background material (for example, the basics of relational algebra) are briefly reviewed in the introductory chapters. This provides the reader with a convenient reference for the main topics. The material on MapReduce, however, is not as widely used in the book as one would expect. An exception is the chapter on link analysis, where parallelism plays a major practical role. One wonders why this chapter does not follow immediately after the MapReduce chapter. The other chapters are largely self-contained and may be read in any order.

Although theoretical issues are discussed where relevant, the focus of the text is clearly on practical issues. Readers interested in a more rigorous treatment of the theoretical foundations for these techniques should look elsewhere. Fortunately, each chapter contains key references to guide the more formally minded reader.

Reviewer:  Saturnino Luz Review #: CR140477 (1212-1202)
Bookmark and Share
  Reviewer Selected
Featured Reviewer
 
 
Data Mining (H.2.8 ... )
 
 
World Wide Web (WWW) (H.3.4 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Data Mining": Date
Feature selection and effective classifiers
Deogun J. (ed), Choubey S., Raghavan V. (ed), Sever H. (ed) Journal of the American Society for Information Science 49(5): 423-434, 1998. Type: Article
May 1 1999
Rule induction with extension matrices
Wu X. (ed) Journal of the American Society for Information Science 49(5): 435-454, 1998. Type: Article
Jul 1 1998
Predictive data mining
Weiss S., Indurkhya N., Morgan Kaufmann Publishers Inc., San Francisco, CA, 1998. Type: Book (9781558604032)
Feb 1 1999
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy