Distributed processing creates special problems in reconciling the work performed by the computational modules. Faults can occur and be propagated as computational errors in the module itself and in other modules. When this situation is detected, it is necessary to recover what is good and redo the portions that have been contaminated with error. Lin and Shin examine this problem under conditions in which fault and error detection may be neither instantaneous nor complete. There are two major portions of this paper: formal analysis of damage assessment and evaluation of the optimal rollback points.
Damage assessment is examined using three cases: the error is detected and the faulty module is identified; the error is detected but the faulty module is unidentified; and a fault is uncovered by periodic diagnostics. For each of the cases, the authors generate sets of equations describing the density function for the conditional probability that a computational node’s contamination time is no later than some time t. The derivations are detailed and lengthy. Fortunately, the authors provide a list of symbols in this section to assist readers.
The evaluation of optimal rollback points is a problem in nonlinear integer programming. Two algorithms are developed, the first being a rollback algorithm that minimizes the mean recovery overhead, and the second being a branch-and-bound algorithm that serves the rollback algorithm by extracting and examining subsets of rollback points. A small simulation is provided as a demonstration. A list of symbols should have been attached to this section as well.
In the last section, the authors discuss the integration of damage assessment with optimistic message logging and checkpointing schemes. They note that a significant problem in integrating damage assessment with an existing rollback scheme is the amount of stable storage required to log messages.