In high-performance computing (HPC) environments, heterogeneous compute systems comprised of a distributed cluster of multicore nodes with accelerators such as graphics processing units (GPUs) are becoming the norm. There are robust and mature programming models for distributed computing, as well as for GPU computing. However, it is only recently that hybrid programming models that integrate GPU computing capabilities with explicit message passing have been developed. Aji et al. characterize the performance and productivity of a specific GPU-integrated message passing interface (MPI) framework, MPI-ACC, in two scientific computing applications: an epidemiology simulation and a seismology modeling application.
The primary contribution of their work is a detailed case study of two scientific computing applications in which a basic MPI+GPU model that is not integrated is compared to the GPU-integrated MPI-ACC framework. The authors describe the performance effects of using each cluster node’s central processing unit (CPU) concurrently with the node’s GPU, rather than using the GPU exclusively. They also evaluate the effects of various optimizations, such as data communication patterns to reduce communication overhead, and the use of data partitioning to increase concurrency and maximize GPU memory bandwidth. They evaluate these applications using HPCToolkit, and find that the GPU-integrated MPI framework generally outperforms the base MPI+GPU implementations for both applications. The results also show that the use of profiling tools such as HPCToolkit can expose problem areas that, when solved, can lead to significant performance improvements.
This paper is thorough and well written, and should be of interest to application developers looking for a detailed analysis of how to use a GPU-integrated MPI framework to build and optimize scientific applications. The authors do not go into detail on GPU kernel implementations. Rather, they focus on the interaction between message passing with MPI and the CPU interface to the GPU. The MPI-ACC framework is the only model discussed, so it would be interesting to learn how the techniques applied in this paper translate to other GPU-integrated MPI frameworks.