Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Speculative segmented sum for sparse matrix-vector multiplication on heterogeneous processors
Liu W., Vinter B. Parallel Computing49 (C):179-193,2015.Type:Article
Date Reviewed: Apr 21 2016

Sparse matrix-vector multiplication (SpMV) is a key operation in many scientific and graph applications. It is also challenging because of data compression and load balancing issues. Data compression is needed to efficiently store the matrix elements, and load balancing is needed to avoid idle processing capacity. Because the sparsity structure of applications has a lot of variation, static solutions generally do not work very well. Also, programming languages do not offer much support for expressing sparse structures and compiler optimizations are generally very difficult due to the indirect array accesses.

A completely new challenge is to use graphics processing units (GPUs) for SpMV operations. Now, GPUs are at their very best when all operations take the same amount of time, a static workload partition is applied, and little synchronization is needed. This all seems very contradictory to the requirements of SpMV operations. One solution is to adapt the data compression scheme such that workload partitioning is easier. When many iterations are needed, this might be an applicable solution. However, if only a few iterations per cycle are done, this conversion might not be worth the effort.

The data compression technique that is widely applied is the compressed sparse row (CSR) format. The CSR format does not naturally reveal load balancing information, therefore adapted formats have been used, but not without flaws; for example, they are not able to detect empty rows.

The authors follow a different approach and make use of the intrinsic differences between central processing units (CPUs) and GPUs; that is, they propose a heterogeneous solution. Core to their approach is to first speculatively execute the SpMV operation on the GPUs and then rearrange the resulting vectors on the CPUs, because CPUs are better suited for doing the final irregular accesses and global synchronization than GPUs. The algorithm is given in the appendix of the paper and is very clearly illustrated by a simple sparse SpMV example.

The paper contains an extensive experiment section, using 20 sparse matrix structures from the University of Florida Sparse Matrix Collection and three CPU-GPU platforms from Intel, NVDIA, and AMD, respectively. Speedups of on average 2.6 and no obvious losses for dense systems are reported. What remains is the setting of four tuning parameters. The W parameter (workload per thread) especially can be critical, because if it is chosen too high, the performance may degrade. The authors do not provide users with a tuning recipe for tuning their application so this does not happen. Users just have to tune their application themselves. Last but not least, the heterogeneous processing is in alternating mode, meaning the GPUs and CPUs are not used at the same time. The authors hint in their conclusion that this can be further investigated. For doing so, one needs a workload model to partition the work over the CPUs and GPUs (for example, see reference 55 in the paper). Given the irregular nature of the memory accesses, this is a true challenge.

Overall, I liked the paper very much. It is a very valuable contribution to the field of sparse matrix computation. The experiments are extensive and convincing, and I am looking forward to a total heterogeneous solution.

Reviewer:  H. Sips Review #: CR144345 (1607-0501)
Bookmark and Share
 
Parallel Processors (C.1.2 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Parallel Processors": Date
Spending your free time
Gelernter D. (ed), Philbin J. BYTE 15(5): 213-ff, 1990. Type: Article
Apr 1 1992
Higher speed transputer communication using shared memory
Boianov L., Knowles A. Microprocessors & Microsystems 15(2): 67-72, 1991. Type: Article
Jun 1 1992
On stability and performance of parallel processing systems
Bambos N., Walrand J. (ed) Journal of the ACM 38(2): 429-452, 1991. Type: Article
Sep 1 1992
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy