Computing Reviews

Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors
Liu W., Vinter B. Parallel Computing49(C):179-193,2015.Type:Article
Date Reviewed: 04/21/16

Sparse matrix-vector multiplication (SpMV) is a key operation in many scientific and graph applications. It is also challenging because of data compression and load balancing issues. Data compression is needed to efficiently store the matrix elements, and load balancing is needed to avoid idle processing capacity. Because the sparsity structure of applications has a lot of variation, static solutions generally do not work very well. Also, programming languages do not offer much support for expressing sparse structures and compiler optimizations are generally very difficult due to the indirect array accesses.

A completely new challenge is to use graphics processing units (GPUs) for SpMV operations. Now, GPUs are at their very best when all operations take the same amount of time, a static workload partition is applied, and little synchronization is needed. This all seems very contradictory to the requirements of SpMV operations. One solution is to adapt the data compression scheme such that workload partitioning is easier. When many iterations are needed, this might be an applicable solution. However, if only a few iterations per cycle are done, this conversion might not be worth the effort.

The data compression technique that is widely applied is the compressed sparse row (CSR) format. The CSR format does not naturally reveal load balancing information, therefore adapted formats have been used, but not without flaws; for example, they are not able to detect empty rows.

The authors follow a different approach and make use of the intrinsic differences between central processing units (CPUs) and GPUs; that is, they propose a heterogeneous solution. Core to their approach is to first speculatively execute the SpMV operation on the GPUs and then rearrange the resulting vectors on the CPUs, because CPUs are better suited for doing the final irregular accesses and global synchronization than GPUs. The algorithm is given in the appendix of the paper and is very clearly illustrated by a simple sparse SpMV example.

The paper contains an extensive experiment section, using 20 sparse matrix structures from the University of Florida Sparse Matrix Collection and three CPU-GPU platforms from Intel, NVDIA, and AMD, respectively. Speedups of on average 2.6 and no obvious losses for dense systems are reported. What remains is the setting of four tuning parameters. The W parameter (workload per thread) especially can be critical, because if it is chosen too high, the performance may degrade. The authors do not provide users with a tuning recipe for tuning their application so this does not happen. Users just have to tune their application themselves. Last but not least, the heterogeneous processing is in alternating mode, meaning the GPUs and CPUs are not used at the same time. The authors hint in their conclusion that this can be further investigated. For doing so, one needs a workload model to partition the work over the CPUs and GPUs (for example, see reference 55 in the paper). Given the irregular nature of the memory accesses, this is a true challenge.

Overall, I liked the paper very much. It is a very valuable contribution to the field of sparse matrix computation. The experiments are extensive and convincing, and I am looking forward to a total heterogeneous solution.

Reviewer:  H. Sips Review #: CR144345 (1607-0501)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy