Computing Reviews, the leading online review service for computing literature.

Search

On Fault-Tolerant Structure, Distributed Fault-Diagnosis, Reconfiguration, and Recovery of the Array Processors
Hosseini S. IEEE Transactions on Computers38 (7):932-942,1989.Type:Article

Date Reviewed: Oct 1 1990

As the number of processors and interconnecting links in a multiprocessor increases, the risk of hardware failure grows. Hosseini’s paper addresses fault tolerance for parallel processing architectures. It explains a new interconnection method for array processors that is sparse enough to be cheap but redundant enough to allow fault tolerance through error detection, system reconfiguration, and system recovery. The paper is excellent; the ideas are new, simple, useful, and presented clearly. Where possible, the cost for overhead is computed; otherwise, a plausible estimate is given. The introduction reviews earlier work on fault tolerance for parallel processors. Previous schemes (1) rely on a certain amount of unfailing parallel hardware, (2) require a central host that performs fault diagnosis and reconfiguration for subordinate processors, or (3) consume considerable computer time. Approach (1) is unrealistic; even simple hardware components fail eventually. Scheme (2) is undesirable as the host represents a bottleneck and is itself subject to failure. Method (3) is costly and defeats the advantage multiprocessing is intended to bring. Hosseini’s scheme has none of these drawbacks. Obviously, the scheme has a cost, but the cost in terms of time and silicon area is contained and quantified as part of the paper. Section 2 explains the new method for a linear array of M processors, grouped into m clusters with n processors per cluster. A processor works either in normal mode or in check mode. Check mode saves a checkpoint and looks for faults. If check mode finds no faults, the new checkpoint state overrides the previous state. Otherwise, if some links or processors have failed, the system is reconfigured using only provably working resources. After reconfiguration, the last saved checkpoint is reinstated. Each processor in the array performs the checking task at some time; there is no master or central host. The array has a limited number of I/O processors (IOPs), which is greater than one, to accommodate IOP failure. Section 2 also estimates the silicon overhead area, which turns out to be about 13 percent. In case of failure, the node detecting the error broadcasts a FAIL signal and informs all connected working processors about the failing element. Hosseini explains the method in detail and gives more refinement for IOPs and for inter- and intracluster links. Section 2, the paper’s core, presents the detailed algorithms for diagnosis and for reconfiguration. Interestingly, the reconfiguration algorithm uses backtracking. Section 3 is a generalization of the linear array to the two-dimensional case. Expanding the ideas to the next dimension is relatively simple, by applying the linear topology one row at a time to the two-dimensional topology. Section 3 then computes the silicon area overhead again, including the overhead for processor links. Section 4, “Performance,” explains the degree of graceful degradation: for a large number of processors M, the system is guaranteed to remain functional with only half the original hardware resources. Generally, the system remains working even if more than half of the total hardware fails. Section 5, “Synchronization,” describes a method of synchronizing the individual processor clocks. The paper does not clarify whether failures are “allowed” to occur at any time or only in normal mode. Obviously, faults will occur in check mode and while the system is in the process of reconfiguring itself while recovering from previous faults. The paper and the title seem to address array processors, but the author does not clearly differentiate between the various multiprocessing architectures. Not only is the distinction between array (SIMD) and general (MIMD) multiprocessors blurred, but the author confuses linear systolic arrays such as CMU’s wrap (sic; Warp was probably meant) with array processors. These minor gaps are the only blurs in a generally sharp picture and do not reduce the valuable contribution toward fault tolerance in multiprocessing. The paper presents astonishing material. I was fascinated to read of this relatively straightforward multiprocessor topology and recovery mechanism that allows one processor after another, and one interconnection after another, to be knocked out (a smart recovery method allows the system to start over again). Hosseini’s method appears to provide for an invincible architecture. With this method, computer architecture can gain execution speed and reliability. Anybody designing a new parallel processing architecture must seriously consider the approach proposed by Hosseini. Future multiprocessors should be at least as good; Hosseini’s method defines a clear point of reference with new and good ideas.

Reviewer: Herbert G. Mayer	Review #: CR114262

Array And Vector Processors (C.1.2 ... )

Redundant Design (B.1.3 ... )

Redundant Design (B.2.3 ... )

Reliability, Availability, And Serviceability (C.4 ... )

Control Structure Reliability, Testing, And Fault-Tolerance (B.1.3 )

Reliability, Testing, And Fault-Tolerance (B.2.3 )

Would you recommend this review?

yes

Other reviews under "Array And Vector Processors":	Date

Array processing machines: an abstract model van Leeuwen J., Wiedermann J. BIT 27(1): 25-43, 1987. Type: Article	Mar 1 1988

A unified approach to a class of data movements on an array processor Flanders P. IEEE Transactions on Computers 31(9): 809-819, 1982. Type: Article	May 1 1985

Gracefully degradable processor arrays Fortes J., Raghavendra C. IEEE Transactions on Computers 34(11): 1033-1044, 1985. Type: Article	Nov 1 1986

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy