Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Hadoop application architectures
Grover M., Malaska T., Seidman J., Shapira G., O’Reilly Media, Inc., Sebastopol, CA, 2015. 400 pp. Type: Book (978-1-491900-08-6)
Date Reviewed: Apr 22 2016

This is a wonderful handbook for Hadoop data engineers. Hadoop has become a platform for data science, and its scalability has been shown capable to handle massive volumes of data. While more and more enterprises adopt Hadoop as their data processing platform, understanding Hadoop remains a challenge for many data engineers. This book provides the best guidelines for Hadoop engineers to sharpen their data processing skills. Note that it assumes software engineers have already been exposed to traditional databases (for example, SQL) and Hadoop, and have fundamental knowledge in at least Flume, HBase, Pig, and Hive, which are Hadoop components. The book is organized into ten chapters. Chapters 1 to 7 are dedicated to underlying principles of Hadoop architectures, and chapters 8 to 10 present case studies.

Chapter 1 introduces Hadoop to readers. It briefly mentions the fundamental concepts (for example, Java, SQL, Flume, HBase, Hive) necessary for this book. Further, this chapter covers data storing and data modeling in Hadoop, including file formats, data organization, and meta-management.

Chapter 2 starts the core of data processing in Hadoop. When data arrives, determining where to place it is an important step. Further, a data file being retrieved or processed may need to be placed at another location. Hence, moving data is frequently seen in data engineering. This chapter presents data moving tools in Hadoop and shows various principles of data moving management.

Chapter 3 focuses on data processing in Hadoop. Although data processing depends on practical applications, there exist underlying tools such as splitting, joining, filtering, cascading, and abstracting data. Hadoop offers MapReduce and Spark to allow software engineers to handle data. This chapter not only introduces underlying data processing tools, but also presents various coding examples to let readers understand the tools inside and out.

Chapter 4 introduces data processing patterns. There are many data analysis tasks that a data scientist performs from time to time. Examples include duplication removal, windowing analysis, and time series analysis. The authors show the concepts of the analysis examples and present codes along with the examples.

Chapter 5 presents graph processing in Hadoop. Graphs are important in computer science because they can be used to represent data storing or data processing. An important merit of graph-based processing is that a graph can be visualized. Hadoop also comes with a graph tool. This chapter shows various examples of how to use Hadoop’s built-in tools to perform graph-based data analysis.

Chapter 6 presents orchestration. A data processing task consists of many steps, which collectively assemble a workflow. Applications range from business intelligence to scientific studies. There are multiple orchestration frameworks, such as Oozie, Azkaban, Luigi, and Chronos. For simplicity, the authors chose Oozie to illustrate how to use Hadoop to implement the Oozie workflow framework.

Chapter 7 presents real-time data processing. Nowadays, more and more applications require real-time or near real-time data processing, such as email exchanges, data streaming, and stock trading. This chapter extends the concepts taught in previous chapters and shows how data processing and workflow are adapted to handle low-latency or real-time analysis.

Chapters 8 to 10 present three cases achieved by the Hadoop platform. Chapter 8 illustrates a clickstream analysis. Chapter 9 shows how to construct a Hadoop platform for fraud detection. Finally, chapter 10 gives an example of a data warehouse based on Hadoop.

In general, the book is well written. The authors provide good explanations. Various examples, diagrams, and sample codes are given to walk readers through the concepts. Each chapter ends with useful links to other advanced studies. This book will help elevate a Hadoop engineer to a more sophisticated level.

More reviews about this item: Amazon, i-Programmer

Reviewer:  Hsun-Hsien Chang Review #: CR144348 (1607-0457)
Bookmark and Share
  Reviewer Selected
Featured Reviewer
 
 
Distributed Programming (D.1.3 ... )
 
 
Data Models (H.2.1 ... )
 
 
Data Storage Representations (E.2 )
 
Would you recommend this review?
yes
no
Other reviews under "Distributed Programming": Date
Topics in distributed algorithms
Tel G., Cambridge University Press, New York, NY, 1991. Type: Book (9780521403764)
Sep 1 1992
Interacting processes
Francez N., Forman I., ACM Press/Addison-Wesley Publ. Co., New York, NY, 1996. Type: Book (9780201565287)
Jan 1 1997
Verification of sequential and concurrent programs (2nd ed.)
Apt K. (ed), Olderog E., Springer-Verlag New York, Inc., Secaucus, NJ, 1997. Type: Book (9780387948966)
Feb 1 1998
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy