An important challenge of parallel programming--programming optimization with regard to the specific characteristics of the target computer architecture--is addressed in this paper. For this, an auto-tuning approach is implemented and evaluated using matrix multiplication as an example.
Auto-tuning is a technique to automatically perform program optimizations; it is based on two core ideas:
- (1) the target application is implemented in a parameterized form regarding so-called tuning parameters, such that different parameter values lead to semantically equal but differently optimized code variants; and
- (2) the tuning parameters’ values are determined for the specific characteristics of a device by performing an automated search on the parameter space.
The parameterized implementation of matrix multiplication is developed by the authors using the heterogeneous programming library (HPL)--a high-level parallel programming model on top of OpenCL. The implementation comprises 14 tuning parameters that capture optimizations such as granularity, loop unrolling factors, local memory usage, and vectorization. A search engine for determining device-optimized parameter values is implemented based on a genetic algorithm.
The performance evaluation is conducted on four different device architectures: Intel central processing units (CPUs), NVIDIA/AMD graphics processing units (GPUs), and Intel Xeon Phi co-processors. Two reference implementations are used for comparison: clBLAS and ViennaCL. The authors show that their suggested auto-tuning system enables better performance than the reference implementations.