Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
CUDA-enabled sparse matrix-vector multiplication on GPUs using atomic operations
Dang H., Schmidt B. Parallel Computing39 (11):737-750,2013.Type:Article
Date Reviewed: Aug 8 2014

Sparse matrix-vector multiplication is a critical component of many applications, including physical modeling, linear solvers, and life science analyses. Improved performance of sparse matrix-vector multiplication allows applications using these operations to perform better and/or handle increased data resolution. Dang and Schmidt offer a new format for sparse matrices as well as a sparse matrix-vector multiplication implementation that takes advantage of current graphical processing unit (GPU) architectures for improved performance over existing formats and implementations.

The contributions of their work include the sliced coordinate (SCOO) sparse matrix format and a CUDA-enabled parallel implementation of a sparse matrix-vector multiplication algorithm for NVIDIA GPUs. Their implementation is highly tuned for NVIDIA’s GPU architecture, taking into consideration memory alignment and coalescing, cache hit rates, utilization of texture memory, multi-GPU support, asynchronous stream support, and usage of optimized atomic operations for single-precision data.

The SCOO format and implementation are tested against several existing formats, using both NVIDIA’s Cusp library on both Fermi and Kepler class NVIDIA GPUs, and Intel’s Math Kernel Library (MKL) on a Sandy Bridge central processing unit (CPU). They test on a set of structured and unstructured sparse matrices with moderate performance improvement: 2.39 to 5.25 average speedup for GPU implementations and a speedup factor of 5.5 to 18 compared to the MKL CPU implementation. The authors concede an important point: their format and implementation performs considerably better for single-precision floating-point values compared to double precision, primarily due to non-native hardware support for 64-bit floating-point values. However, as hardware support for this data type improves, the proposed implementation should accommodate the change well.

Practitioners and domain experts may find this paper to be too high level to directly incorporate the SCOO format and parallel implementation into their own applications, and there is no mention of including this work in a callable library. However, the source code is publicly available, so developers craving the extra performance Dang and Schmidt propose could incorporate this work into their own applications.

Reviewer:  Chris Lupo Review #: CR142605 (1411-0993)
Bookmark and Share
 
Graphics Processors (I.3.1 ... )
 
 
Sparse, Structured, And Very Large Systems (Direct And Iterative Methods) (G.1.3 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Graphics Processors": Date
Introduction to volume rendering
Lichtenbelt B., Crane R., Naqvi S., Prentice-Hall, Inc., Upper Saddle River, NJ, 1998. Type: Book (9780138616830)
May 1 1999
Time/space tradeoffs for polygon mesh rendering
Bar-Yehuda R., Gotsman C. ACM Transactions on Graphics (TOG) 15(2): 141-152, 1996. Type: Article
Jul 1 1997
A programmable vertex shader with fixed-point SIMD datapath for low power wireless applications
Sohn J., Woo R., Yoo H.  Graphics hardware (Proceedings of the ACM SIGGRAPH/EUROGRAPHICS Conference on Graphics Hardware, Grenoble, France, Aug 29-30, 2004)107-114, 2004. Type: Proceedings
Jul 8 2005
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy