The authors contend that large-scale multiprocessors are plagued by failures in hardware and software that frequently bring down the entire system, requiring that the machine be rebooted. They propose a scheme for fault containment, then attempt to show its effectiveness by simulation.
I had great difficulty in following this paper. Much of the work leading to this project is described in specialized conference proceedings and symposia. It may be difficult for the reader to locate those papers, even though they are appropriately cited. The average CACM reader will not be familiar with this background, which should have been summarized where applicable.
Experiments and simulations were run on a model called Hive to show the effectiveness of the approach. Hive is not explained clearly with respect to the problem at hand.
The results are summarized in a table and a figure. The table shows only the errors injected into the system, not the effectiveness of the technique. The figure shows the time to completion of five multiprocessor combinations running three programs, but does not demonstrate the advantages, if any, of the technique. The accompanying text did not clarify the figure, but further confused me.
This paper should not have appeared in CACM; it would have been better if it were assigned to a more specialized journal.