About ten years ago, there was a movement away from the expensive, dedicated storage area network (SAN) architectures that were available at the time, and toward more inexpensive storage systems using transmission control protocol/Internet protocol (TCP/IP). This resulted in the use of Internet small computer system interface (iSCSI) as a transport, with serial advanced technology attachment (SATA) disks as the storage medium. Although this greatly reduced the cost of SANs, there was a need to migrate storage to commodity systems with the recognition that servers and disks would periodically fail.
The Hadoop file system and architecture was starting to be developed around this time. Today, Hadoop is widely used because it is inexpensive and built to withstand node and disk failure. However, planning and installing Hadoop can be a daunting task. Although the software is free, the available installation documentation on setting up the file system is poorly written and often skips important details. This new book is a welcome resource for planning and installing Hadoop.
Early in the book, the author endorses the philosophy behind Hadoop and its implementation:
Hadoop was specifically designed to run on a large number of completely standalone commodity systems. Attempting to shoehorn it back into traditional enterprise storage and virtualization systems only results in significantly higher cost for reduced performance. Some percentage of readers will build clusters out of these components and they will work, but they will not be optimal. Exotic deployments of Hadoop usually end in exotic results, and not in a good way. You have been sufficiently warned.
This book focuses on two Hadoop distributions: Apache and Cloudera.
Installing Apache Hadoop from tarballs is a difficult task. The author does not recommend this and encourages readers to install Apache from packages like RPM (Redhat’s packaging system) or deb (Debian’s packaging system) whenever possible. This approach greatly expedites the installation process.
Cloudera’s Distribution Including Apache Hadoop (CDH) is the second distribution. This version of Hadoop is easier to install than Apache.
The book does not cover other Hadoop distributions. For example, the Open Science Grid (OSG) uses a heavily modified version of Hadoop that is tied into Filesystem in Userspace (FUSE). Using FUSE results in a degradation in performance for Hadoop. This probably explains why other distributions like this are not discussed in this book.
Once Hadoop is installed on a cluster, the next major undertaking is the configuration process. People will find this book a useful reference for configuring Hadoop. There are over 300 parameters that can further define the software. The author does an excellent job of describing these parameters. The remainder of the book addresses resource management, cluster maintenance, and troubleshooting.
Overall, this is an excellent book to have on hand when planning, installing, and configuring a Hadoop file system and cluster. This book is useful even if you are not building from one of the two major distributions of Hadoop (Apache and Cloudera). Note, however, that this book does not cover using the Hadoop file system and cluster.
More reviews about this item: Amazon, B&N, GoodReads