Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Big data analytics with R and Hadoop
Prajapati V., Packt Publishing, Birmingham, UK, 2013. 238 pp. Type: Book (978-1-782163-28-2)
Date Reviewed: Sep 2 2014

This is a practical introductory book for analytics and software supporting teams eager to use simple, scalable, and yet powerful ways to manipulate and analyze large and distributed sets of data.

From the first pages we learn that Hadoop is “an open-source Java framework for processing ... vast amounts of data on large clusters of commodity hardware,” and that R is a powerful, actively developed “open-source software package to perform statistical analysis on data” and “a programming language used by data scientist statisticians.” Thus, it is not surprising to learn that software developer groups and companies have invested in combining R and Hadoop in manageable and efficient data processing/analysis environments.

The content is well organized in a preface, seven chapters, and a reference section, which can be read more or less independently depending on the level of knowledge of the reader. Of course, this independence brings a bit of unavoidable redundancy in terms of references, examples, and notes.

There is electronic access to all the exemplified software code, which can be obtained easily not only for the electronic copy but also for the printed version of the book. This, together with a couple of downloadable virtual Hadoop and R environments, pleasantly enhances the reading experience by allowing a simple try-it-yourself approach.

The style is simple and clear, with abundant uniform resource locator (URL) resources, notes, code examples, and screen shots. To follow the examples, the reader must have some basic knowledge of R and understand the meaning of its abstract syntax, such as unlist(lapply(result,’[[’,1)).

Chapter 1, “Getting Ready to Use R and Hadoop,” describes installing R, the RStudio programming environment, and R packages. It summarizes common R data mining techniques that are exemplified in the next chapters: regression, classification, clustering, and recommendation. It finishes with installing Hadoop, HDFS (Hadoop’s “rack-aware file-system”), and introduces MapReduce, “a programming model for processing large datasets distributed on a large cluster.”

MapReduce concepts are introduced in chapter 2, “Writing Hadoop MapReduce Programs.” Maps process partitioned input data in parallel based on keys, and are followed by shuffling/sorting and reducing (that is, aggregating) operations that summarize in the output results. Here we also learn about some limitations of MapReduce, and find simple Java code examples together with ways to monitor and debug MapReduce jobs.

In chapter 3, “Integrating R and Hadoop,” the author focuses on examples of R and Hadoop integration using packages from different vendors or public sources, such as RHIPE, RHadoop, and HadoopStreaming. The MapReduce approach is obviously easiest to apply when the desired operation on big data is associative and commutative (for example, counting by summation). In this case, distributing partial sums and calculations in parallel will provide the same results.

In chapter 4, “Using Hadoop Streaming with R,” the example is related to the segmentation of web page visits by geolocation and stresses the advantages of data streaming. We find here detailed descriptions of the functions provided by the R package HadoopStreaming (hsTableReader, hsKeyValReader, hsLineReader), together with examples of how to run Hadoop and how to prepare commands and read the results.

The author describes the data analytics project life cycle using a diverse set of examples, such as categorizing web page popularity, “computing the frequency of stock market change,” and a case-study about “predicting the sale price of blue book for bulldozers,” in chapter 5, “Learning Data Analytics with R and Hadoop.” The main R functions used here are glm (generalized linear model) with the Poisson regression family and randomForests.

Elements of supervised and unsupervised machine learning are introduced in chapter 6, “Understanding Big Data Analysis with Machine Learning.” Linear regression is exemplified on a matrix of random numbers; logistic regression on the iris flowers R data. For unsupervised machine learning, k-means clustering is used. The user-based and item-based recommendation algorithms presented are based on another publication [1].

Chapter 7, “Importing and Exporting Data from Various DBs,” briefly shows how to use different types of external data in R. For example, manipulating different type of file formats: .csv, .txt, .RDATA, .rda, and .xlsx; and accessing SQL-based databases, such as MySQL, SQLite, and PostgreSQL, or NoSQL-based distributed document data storage systems, such as MongoDB, HIVE (“a Hadoop-based data warehousing-like framework”), and HBase (“a distributed big data store for Hadoop”).

The book ends with a rich list of resources and an index.

As always when the preponderant focus is the application as opposed to theory, one peril is unavoidable: imperfect reproducibility of examples in practice (because of obsolete packages and versions or differences in syntax and parameters), and this book is no exception.

Overall, the well-written content suggests that Hadoop and MapReduce, this simple and structured type of divide-and-conquer approach in big data analysis, might continue to have a place in the toolbox of any data scientist, especially in combination with R.

More reviews about this item: Amazon, Goodreads

Reviewer:  Adrian Pasculescu Review #: CR142682 (1412-1021)
1) Owen, S.; Anil, R.; Dunning, T.; Friedman, E. Mahout in action. Manning, Shelter Island, NY, 2012.
Bookmark and Share
  Reviewer Selected
 
 
Content Analysis And Indexing (H.3.1 )
 
 
Distributed Systems (C.2.4 )
 
 
Mathematical Software (G.4 )
 
Would you recommend this review?
yes
no
Other reviews under "Content Analysis And Indexing": Date
Personal bibliographic indexes and their computerisation
Heeks R., Taylor Graham Publishing, London, UK, 1986. Type: Book (9789780947568115)
Sep 1 1987
Development of a term association interface for browsing bibliographic data bases based on end users’ word associations
Pejtersen A., Olsen S., Zunde P., Taylor Graham Publishing, London, UK, 1987. Type: Book (9780947568306)
Nov 1 1989
Transforming text into hypertext for a compact disc encyclopedia
Glushko R. ACM SIGCHI Bulletin 20(SI): 293-298, 1989. Type: Article
May 1 1990
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy