Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Scalable relative debugging
Dinh M., Abramson D., Jin C. IEEE Transactions on Parallel and Distributed Systems25 (3):740-749,2014.Type:Article
Date Reviewed: Jun 25 2014

This paper deals with the problem of debugging large distributed applications, with a focus on high-performance computing (HPC) environments. The authors deal with the technique of relative debugging, where the program state of a correct execution of a program is compared with the program state of an incorrect execution of a program to narrow down the bug. This is done by having the programmer supply assertions about the state that compare the correct run with the incorrect one.

The challenge in such assertion-based debugging is that the program state is a function of the input that can vary between the reference and the incorrect run. In such a situation, the division of matrix dimensions across machines can vary. The key contribution of the paper is the observation that considering the array state using the greatest common divisor of the chunk sizes (divisors in each dimension) used for the different runs allows the same number of chunks to be produced from the reference and incorrect runs, which can then be compared. The authors present and evaluate two techniques to make the comparison: the first uses peer-to-peer comparisons of the chunks, while the second computes hashes and compares these instead.

Overall, the paper is interesting, but it is heavily HPC focused; it will probably only interest readers concerned with debugging HPC programs that fail over large datasets. However, some of the presented techniques (such as the greatest common divisor (gcd) observation) might be of interest to general readers.

Reviewer:  Amitabha Roy Review #: CR142440 (1409-0765)
Bookmark and Share
 
Distributed Debugging (D.2.5 ... )
 
 
Assertion Checkers (D.2.4 ... )
 
 
Parallelism And Concurrency (F.1.2 ... )
 
Would you recommend this review?
yes
no
Other reviews under "Distributed Debugging": Date
Coordination algorithm for distributed testing
Rafiq O., Cacciari L. The Journal of Supercomputing 24(2): 203-211, 2003. Type: Article
Mar 3 2004
JaRec: a portable record/replay environment for multi-threaded Java applications
Georges A., Christiaens M., Ronsse M., De Bosschere K. Software--Practice & Experience 34(6): 523-547, 2004. Type: Article
Jan 11 2005

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy