Computing Reviews, the leading online review service for computing literature.

Search

Performance analysis and optimization of MPI collective operations on multi-core clusters
Tu B., Fan J., Zhan J., Zhao X. The Journal of Supercomputing60 (1):141-162,2012.Type:Article

Date Reviewed: Oct 31 2012

Ten years ago, the computing clusters I worked with had four-way symmetric multiprocessor (SMP) nodes. We knew then that, in the future, the number of processing elements per node would have to increase, and we wondered how this would be achieved in a scalable yet economical way. The future has arrived and is somewhat different than we were expecting. Most current SMPs are two-way only, but each central processing unit (CPU) comes with multiple cores (typically, six to eight). Over the years, a lot of effort has been invested into optimizing message passing interface (MPI) collective operations for the hierarchical topology of SMP clusters, to work around the much higher cost of sending data between the nodes, rather than exchanging it intra-node. The results are good, but modern systems with multicore CPUs introduce another level into the memory hierarchy, which makes each node behave not unlike cache coherent nonuniform memory access (ccNUMA). On the plus side, the cores in each CPU share a large and fast level 2 (L2) cache. On the minus side, many more cores compete for the limited bandwidth available between the different CPUs. This is where the ideas in this paper come into the picture. The authors recognize that earlier parallel computation models do not adequately take into account the horizontal memory hierarchy characteristic of multicore multiprocessor computers, and that a failure to maximize the reuse of the shared caches may lead to new communication bottlenecks. To address this problem, two new models for parallel computation are proposed: mlog_nP and 2log_{2,3}P. The paper supports these with a case study of a broadcast algorithm, which shows that the cost prediction of the latter model is more accurate than that of an earlier log₃P model. Besides the theoretical models, the paper also presents a high-level portable (using standard MPI_Bcast) implementation of a multicore-aware broadcast operation that attempts to minimize data exchange between CPUs and maximize reuse of the L2 cache shared between cores in the same CPU using a data tiling approach. Experimental benchmark results show that as the message size increases, so does the performance advantage of the new broadcast implementation. Despite focusing on MPI broadcast as a test case, this paper is of interest not just to the relatively narrow group of developers implementing MPI libraries. In fact, by demonstrating the performance benefits of data tiling and explicitly maximizing cache reuse between cores of the same CPU, this paper is a source of good, practical optimization ideas for any parallel application developer using MPI, OpenMP, or hybrid OpenMP/MPI programming paradigms.

Reviewer: Maciej Golebiewski	Review #: CR140638 (1302-0101)

Multiple Data Stream Architectures (Multiprocessors) (C.1.2 )

Multiprocessing/ Multiprogramming/ Multitasking (D.4.1 ... )

Would you recommend this review?

yes

Other reviews under "Multiple Data Stream Architectures (Multiprocessors)":	Date

Cache-coherent multiprocessors Baskett F., University Video Communications, Stanford, CA, 1991. Type: Book	Feb 1 1994

Multiple processor systems for real-time applications Liebowitz B., Carson J., Prentice-Hall, Inc., Upper Saddle River, NJ, 1985. Type: Book (9789780136051145)	Jan 1 1986

Multicomputer networks: message-based parallel processing Reed D., Fujimoto R., MIT Press, Cambridge, MA, 1988. Type: Book (9789780262181297)	Apr 1 1989

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy