General-purpose graphics processing units (GPGPUs) are becoming increasingly popular as a means to accelerate various scientific kernels, particularly evidenced by their adoption into the high-performance computing community and by the integration of GPU cores into mainstream central processing units (CPUs). However, GPU performance tuning has thus far been a niche area due to the lack of tools for determining the factors that contribute to the performance of individual program components. This is in contrast to the CPU domain where many mature tools exist that do a good job of performance analysis. This paper is an excellent first step in that direction.
The authors present a performance analysis framework based on a GPU kernel that can attribute its execution time to different contributing factors. For example, it is able to separate the time spent waiting for memory access from the time spent computing results. Readers interested in performance modeling will find this approach instructive and novel.
The paper starts with the construction of a detailed analytic performance model of the GPU. The authors then use a combination of statistically determined metrics (such as instruction group sizes within a basic block) with dynamically determined performance metrics (such as instruction mix) to parameterize the model. They can then predict the effect of tuning on various optimizations, some based on algorithm changes and some that can be automatically applied (such as using the available shared memory on the GPU). They also perform a deep and introspective modeling of parallelism. For example, the authors carefully separate parallelism during memory access from parallelism during computation, taking into account both the characteristics of the application and the parameters of the underlying GPU.
This performance analysis framework would be useful to those interested in optimizing the execution of their GPU kernels. The paper is accessible to most people interested in performance modeling, although the terminology in some places is naturally GPU-centric and the testing (as presented) is limited to one specific GPU.