ComputingReviews.com

Writing a performance-portable matrix multiplication
Fabeiro J., Andrade D., Fraguela B. Parallel Computing52(C):65-77,2016.Type:Article

Date Reviewed: 05/26/16

An important challenge of parallel programming--programming optimization with regard to the specific characteristics of the target computer architecture--is addressed in this paper. For this, an auto-tuning approach is implemented and evaluated using matrix multiplication as an example.

Auto-tuning is a technique to automatically perform program optimizations; it is based on two core ideas:

(1) the target application is implemented in a parameterized form regarding so-called tuning parameters, such that different parameter values lead to semantically equal but differently optimized code variants; and
(2) the tuning parameters’ values are determined for the specific characteristics of a device by performing an automated search on the parameter space.

The parameterized implementation of matrix multiplication is developed by the authors using the heterogeneous programming library (HPL)--a high-level parallel programming model on top of OpenCL. The implementation comprises 14 tuning parameters that capture optimizations such as granularity, loop unrolling factors, local memory usage, and vectorization. A search engine for determining device-optimized parameter values is implemented based on a genetic algorithm.

The performance evaluation is conducted on four different device architectures: Intel central processing units (CPUs), NVIDIA/AMD graphics processing units (GPUs), and Intel Xeon Phi co-processors. Two reference implementations are used for comparison: clBLAS and ViennaCL. The authors show that their suggested auto-tuning system enables better performance than the reference implementations.

Reviewer: Sergei Gorlatch

Review #: CR144455 (1608-0587)

Reproduction in whole or in part without permission is prohibited. Copyright 2024 ComputingReviews.com™
Terms of Use | Privacy Policy