Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Hadoop 2 quick-start guide : learn the essentials of big data computing in the Apache Hadoop 2 ecosystem
Eadline D., Addison-Wesley Professional, Old Tappan, NJ, 2016. 304 pp. Type: Book (978-0-134049-94-6)
Date Reviewed: Mar 24 2016

There is community-based clamor for adding an “easy” button to Hadoop. Hadoop 2 quick-start guide (from now on H2QSG) tries to meet that need, making the installation, care, and feeding of Hadoop as simple as possible--but not simpler. H2QSG is a well-structured, appropriately brief guide that succeeds in delivering essentials information in an inspiringly concise, encompassing, well-balanced package. It skirts with poise the danger of becoming simplistic at the cost of missing a trivialized “easy” mark; unfortunately, the popularization of a distributed operating system (OS)--which is the ultimate nature of Hadoop 2--is far from that aspirational goal, and will likely remain so for a while. Kudos to H2QSG for accepting the current state of affairs and explaining it instead of turning into a premature collection of oversimplifications.

The content of H2QSG is addressed to a technically mature audience, without any prior intimate understanding of Hadoop, that needs to deploy a working cluster and accomplish something useful with it at the proof-of-concept or prototype level. The exposition is geared toward jump-starting installation and startup while paying tribute to many important concepts, mechanisms, and tools that are key to turning a running cluster into a platform usable toward a goal. H2QSG dutifully reminds the reader regularly that this is not a complete reference; while so doing, it also provides curated lists with sources of additional information for those who, after having gained enough conceptual understanding and elementary practice, now recognize a need for better understanding of some particular area.

H2QSG can be held at fault for being too partial to the Hortonworks Data Platform (HDP) Hadoop distribution. As other complex open-source projects, Hadoop is a collection of numerous subcomponents; several commercial concerns compete in the marketplace as distributors of prepackaged and tested assemblies, each with specific marketing differentiators. H2QSG is largely focused on one distribution and fails to provide a competitive analysis of alternative offerings. For an unprejudiced audience, this is maybe a minor flaw, but caveat emptor: homework is needed before buying into H2QSG; anyway, much (but not all) of what it preaches is of general applicability.

H2QSG is arranged into ten chapters and five appendices. Chapter 1 introduces Hadoop in the context of big data, presenting the data lake concept, comparing it and contrasting it with traditional data representations. This chapter includes historical notes and observations on the Hadoop ecosystem.

Chapter 2 provides a first explanation of Hadoop services and configuration files, with notes on preliminary resource planning. It proceeds with four distinct, independent recipes for the installation of Hadoop on a single machine, or a pseudo-distributed installation, or a true cluster installation, or a cloud-based, tool-assisted installation. At approximately 40 pages, this is the beefiest chapter in the book.

Chapter 3 is dedicated to the Hadoop distributed file system (HDFS). It comprises introductory coverage of its design and operation, including block replication, rack awareness, high availability, federation, backup, snapshots, and network file system (NFS) compatibility. It covers essential graphical user interface (GUI)-based, command line, and programmatic access.

Chapter 4 addresses how to smoke test an installation. It shows how to run the MapReduce examples provided with the distribution and a few other meaningful benchmarks; reviews the GUI presentment of YARN--the resource manager at the heart of Hadoop 2; and provides initial steps easing the reader into the administration of Hadoop.

Chapter 5 is dedicated to the MapReduce framework. If Hadoop is a distributed OS, MapReduce is its native programming model. For Hadoop 1, it was the only available programming model; this is not true with Hadoop 2, but MapReduce remains a mainstay and this chapter covers in some detail its base concepts and mechanisms.

Chapter 6 builds on chapter 5, introducing the multi-language programming paradigms of MapReduce with examples in Java, Python, C++, and more, including fundamental strategies for debugging.

Chapter 7, the second largest chapter in H2QSG, is a parade of the tools available with the Hadoop distribution, comprising Pig (a scripting tool), Hive (another scripting tool), Sqoop (a data transfer tool), Flume (a streaming tool), Oozie (a workflow manager), and HBase (a NoSQL database). By necessity, the content is deliberately brief for each tool.

Chapter 8 builds on chapter 7 and explains how Hadoop 2 and YARN afford the creation of arbitrary applications that take advantage of the resources they make available. This chapter delves into some detail on the example application (distributed shell), and for further reference provides an annotated list of other projects built on the new Hadoop 2 infrastructure.

Chapter 9 is a tour of Ambari, the Hortonworks GUI-based cluster management tool. Chapter 10 is a compendium of administration procedures; it includes descriptive coverage of the Capacity scheduler, another component of Hadoop.

In closing, appendix A covers the book’s web page and code download; B, troubleshooting flowcharts; C, a curated summary of Hadoop resources; D, installing the Hue GUI; and E, installing Spark, an heir apparent to the MapReduce programming model.

H2QSG is a competent introduction that gives an inspired overview of a complex project of growing popularity. It strikes a meaningful balance among comprehensiveness, conciseness, and detail, never losing sight of the audience’s ability to recognize the need for greater detail when appropriate, and provides external sources that can meet that need.

More reviews about this item: Amazon, BCS

Reviewer:  A. Squassabia Review #: CR144259 (1606-0369)
Bookmark and Share
  Reviewer Selected
Featured Reviewer
 
 
Distributed Programming (D.1.3 ... )
 
 
Distributed Systems (C.2.4 )
 
 
Reference (A.2 )
 
Would you recommend this review?
yes
no
Other reviews under "Distributed Programming": Date
Topics in distributed algorithms
Tel G., Cambridge University Press, New York, NY, 1991. Type: Book (9780521403764)
Sep 1 1992
Interacting processes
Francez N., Forman I., ACM Press/Addison-Wesley Publ. Co., New York, NY, 1996. Type: Book (9780201565287)
Jan 1 1997
Verification of sequential and concurrent programs (2nd ed.)
Apt K. (ed), Olderog E., Springer-Verlag New York, Inc., Secaucus, NJ, 1997. Type: Book (9780387948966)
Feb 1 1998
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy