Sparse matrix-vector multiplication is a critical component of many applications, including physical modeling, linear solvers, and life science analyses. Improved performance of sparse matrix-vector multiplication allows applications using these operations to perform better and/or handle increased data resolution. Dang and Schmidt offer a new format for sparse matrices as well as a sparse matrix-vector multiplication implementation that takes advantage of current graphical processing unit (GPU) architectures for improved performance over existing formats and implementations.
The contributions of their work include the sliced coordinate (SCOO) sparse matrix format and a CUDA-enabled parallel implementation of a sparse matrix-vector multiplication algorithm for NVIDIA GPUs. Their implementation is highly tuned for NVIDIA’s GPU architecture, taking into consideration memory alignment and coalescing, cache hit rates, utilization of texture memory, multi-GPU support, asynchronous stream support, and usage of optimized atomic operations for single-precision data.
The SCOO format and implementation are tested against several existing formats, using both NVIDIA’s Cusp library on both Fermi and Kepler class NVIDIA GPUs, and Intel’s Math Kernel Library (MKL) on a Sandy Bridge central processing unit (CPU). They test on a set of structured and unstructured sparse matrices with moderate performance improvement: 2.39 to 5.25 average speedup for GPU implementations and a speedup factor of 5.5 to 18 compared to the MKL CPU implementation. The authors concede an important point: their format and implementation performs considerably better for single-precision floating-point values compared to double precision, primarily due to non-native hardware support for 64-bit floating-point values. However, as hardware support for this data type improves, the proposed implementation should accommodate the change well.
Practitioners and domain experts may find this paper to be too high level to directly incorporate the SCOO format and parallel implementation into their own applications, and there is no mention of including this work in a callable library. However, the source code is publicly available, so developers craving the extra performance Dang and Schmidt propose could incorporate this work into their own applications.