Traditionally, parallel computation has focused on high-performance computing (HPC), and many special platforms and systems have been built for this purpose. In recent years, a new parallel computing model has emerged for handling big data problems: MapReduce and its Hadoop implementation; it is becoming increasingly important.
To address the rapidly increasing requirement of running Hadoop, it is desirable to use the special HPC platforms and systems that have already been built. This paper develops a design for a Hadoop-over-MPI approach that runs Hadoop on traditional HPC platforms, and implements a MapReduce benchmark application, WordCount, based on the design. Then, the paper compares this Hadoop-over-MPI performance with the Hadoop implementation of WordCount.
The paper demonstrates that the Hadoop-over-MPI implementation performs much better than the Hadoop implementation. According to Cheptsov, “the nominal performance of MPI is indeed higher than the [performance] of Hadoop, [and ...] the poor performance of Hadoop could be caused by [the] small size of [the] particular experiment setup.”
Besides the evaluation, the paper also gives a good introduction to the MapReduce and MPI technologies. It is a good read for those who are interested in how MapReduce and MPI work, how they differ from each other, and how their performances compare with each other. Any engineer, researcher, or scientist in information technology will find this paper interesting.