Computing Reviews

Scalable relative debugging
Dinh M., Abramson D., Jin C. IEEE Transactions on Parallel and Distributed Systems25(3):740-749,2014.Type:Article
Date Reviewed: 06/25/14

This paper deals with the problem of debugging large distributed applications, with a focus on high-performance computing (HPC) environments. The authors deal with the technique of relative debugging, where the program state of a correct execution of a program is compared with the program state of an incorrect execution of a program to narrow down the bug. This is done by having the programmer supply assertions about the state that compare the correct run with the incorrect one.

The challenge in such assertion-based debugging is that the program state is a function of the input that can vary between the reference and the incorrect run. In such a situation, the division of matrix dimensions across machines can vary. The key contribution of the paper is the observation that considering the array state using the greatest common divisor of the chunk sizes (divisors in each dimension) used for the different runs allows the same number of chunks to be produced from the reference and incorrect runs, which can then be compared. The authors present and evaluate two techniques to make the comparison: the first uses peer-to-peer comparisons of the chunks, while the second computes hashes and compares these instead.

Overall, the paper is interesting, but it is heavily HPC focused; it will probably only interest readers concerned with debugging HPC programs that fail over large datasets. However, some of the presented techniques (such as the greatest common divisor (gcd) observation) might be of interest to general readers.

Reviewer:  Amitabha Roy Review #: CR142440 (1409-0765)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy