Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Performance prediction for Apache Spark platform
Wang K., Khan M.  HPCC, CSS & ICESS 2015 (Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf. on Embedded Software and Systems, Aug 24-26, 2015)166-173.2015.Type:Proceedings
Date Reviewed: Dec 5 2016

With the increased usage of the in-memory distributed computation framework Apache Spark, tools are needed to study, predict, and better understand the performance of a given algorithm in a specific cluster of computers. The execution time, required memory, and input/output (I/O) cost are the most important measurements. The Spark cluster is made up of a master node and worker nodes laid out as multiple sequential stages. The data is split for distributed and parallel processing in this layout. The nodes can in turn vary in their respective computing and memory characteristics as well as the possible random uncertainty in their availability. The amount of data processed is a critical factor as well for performance tuning.

The authors explore a simulation-based performance prediction tool. The two options attempted were to either reduce the amount of data processed such that each step in the cluster has at least one block of data to process or to experiment with a smaller computing cluster. The results of the experiments are extrapolated to determine how the tasks will perform in the real configuration. To reduce the bias in the selection of the data sample for the experiments, the available data is divided into multiple subsets and each subset has equal probability of inclusion in the simulation execution. This approach helps us better understand the contribution of each stage of the computing layout and the abnormalities and interdependencies with the other connected stages.

The applications used to evaluate the effectiveness of the suggested simulation approach were WordCount, logistic regression, k-means clustering, and PageRank. Execution time and memory are predicted better via simulation than the other potential methods. The I/O cost is not effectively predicted for certain applications due to reduced network I/O when working with a smaller setup than the original. Hadoop-based performance predictions cannot be reused for Spark due to the technology difference, primarily the in-memory capabilities unique to Spark. Machine learning-based approaches could not be tried due to scarcity of training data.

Reviewer:  Pragyansmita Nayak Review #: CR144957 (1702-0130)
Bookmark and Share
  Reviewer Selected
Featured Reviewer
 
 
Design (B.5.1 )
 
 
Design Styles (B.6.1 )
 
 
Multiple Data Stream Architectures (Multiprocessors) (C.1.2 )
 
 
Other Architecture Styles (C.1.3 )
 
Would you recommend this review?
yes
no
Other reviews under "Design": Date
Design principles for achieving high-performance submicron digital technologies
Fredkin E., Toffoli T. In Collision-based computing. London, UK: Springer-Verlag, 2002. Type: Book Chapter
Oct 15 2003
Linear Models for Keystream Generators
Golic J. IEEE Transactions on Computers 45(1): 41-49, 1996. Type: Article
Jul 1 1997

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy