There is much debate in the software community about whether fault tolerance, security, privacy, and other resilience qualities should be built into the software as it is designed or added to a fully developed system. Because the faults against which one must provide tolerance are not necessarily known in advance, it is important to be able to add fault tolerance to existing correct “intolerant” software. In this paper, the authors explore how to add fault tolerance to a provably correct multiprocess system. They establish that under certain conditions--for example, high atomicity (that is, each process reads and writes all program variables in a single step)--fault tolerance can be added in a stepwise fashion. The key parameter is not so much how many different faults as it is the number and types of tolerance desired for these faults (failsafe, masking, and nonmasking). Depending on the combination of types of tolerance desired, the process is either polynomial or nondeterministic polynomial-time (NP) complete.
This paper is an extension of the authors’ prior work. It presents theoretical results, with practical implications and direct applications. The paper is self-contained, well organized, and quite readable. However, I found it to be a bit too long; the same material could have been presented more succinctly. Other than that, the paper presents a timely and important contribution.