Ten years ago, the computing clusters I worked with had four-way symmetric multiprocessor (SMP) nodes. We knew then that, in the future, the number of processing elements per node would have to increase, and we wondered how this would be achieved in a scalable yet economical way. The future has arrived and is somewhat different than we were expecting. Most current SMPs are two-way only, but each central processing unit (CPU) comes with multiple cores (typically, six to eight).
Over the years, a lot of effort has been invested into optimizing message passing interface (MPI) collective operations for the hierarchical topology of SMP clusters, to work around the much higher cost of sending data between the nodes, rather than exchanging it intra-node. The results are good, but modern systems with multicore CPUs introduce another level into the memory hierarchy, which makes each node behave not unlike cache coherent nonuniform memory access (ccNUMA). On the plus side, the cores in each CPU share a large and fast level 2 (L2) cache. On the minus side, many more cores compete for the limited bandwidth available between the different CPUs.
This is where the ideas in this paper come into the picture. The authors recognize that earlier parallel computation models do not adequately take into account the horizontal memory hierarchy characteristic of multicore multiprocessor computers, and that a failure to maximize the reuse of the shared caches may lead to new communication bottlenecks. To address this problem, two new models for parallel computation are proposed: mlognP and 2log{2,3}P. The paper supports these with a case study of a broadcast algorithm, which shows that the cost prediction of the latter model is more accurate than that of an earlier log3P model.
Besides the theoretical models, the paper also presents a high-level portable (using standard MPI_Bcast) implementation of a multicore-aware broadcast operation that attempts to minimize data exchange between CPUs and maximize reuse of the L2 cache shared between cores in the same CPU using a data tiling approach. Experimental benchmark results show that as the message size increases, so does the performance advantage of the new broadcast implementation.
Despite focusing on MPI broadcast as a test case, this paper is of interest not just to the relatively narrow group of developers implementing MPI libraries. In fact, by demonstrating the performance benefits of data tiling and explicitly maximizing cache reuse between cores of the same CPU, this paper is a source of good, practical optimization ideas for any parallel application developer using MPI, OpenMP, or hybrid OpenMP/MPI programming paradigms.