Computing Reviews, the leading online review service for computing literature.

Search

Performance of fault-tolerant data and compute intensive programs over a network of workstations
Smith J., Shrivastava S. Theoretical Computer Science196 (1-2):319-345,1998.Type:Article

Date Reviewed: Jan 1 1999

A fault-tolerant implementation for parallel execution of data- and computation-intensive programs that require persistent storage and fault-tolerant facilities is described. The well-known “bag of tasks” structuring technique is used together with remote procedure calls (RPCs) for parallel execution of tasks in a network of workstations. The bag of tasks balances load between an arbitrary collection of slave processes. On the other hand, a single bag of tasks is insufficient to express more general structures in cases where a task depends on more than one other task. In this case, an additional synchronization mechanism is required. Derivation of the upper and lower limits for the computation time required for the parallel processing of tasks is outlined. The estimates derived are demonstrated on three applications, namely, the publicly available ray tracing package rayshade; matrix multiplication; and Cholesky factorization. The authors assess performance based on these three examples. Some of the most interesting conclusions reached in this section are, first, that increasing task size improves performance. Second, the difference between the fault-tolerant execution time and the non-fault-tolerant execution time is not very significant. Third, for sufficiently large granularity, the cost of fault tolerance is small, but fault tolerance provides significant improvement in the event of a failure. Finally, doubling the computing speed of the workstations would not affect the achievable execution rate, but would reduce the number of machines required. With the increasing number of networks of workstations and the increasing need for data- and computation-intensive applications, the implementation outlined and the conclusions reached in the paper have a high likelihood of being used extensively in the future.

Reviewer: Ismet Gungor	Review #: CR121870 (9901-0026)

Fault Tolerance (C.4 ... )

Parallelism And Concurrency (F.1.2 ... )

Workstations (C.5.3 ... )

Distributed Systems (C.2.4 )

Would you recommend this review?

yes

Other reviews under "Fault Tolerance":	Date

System diagnosis with smallest risk of error Diks K., Pelc A. Theoretical Computer Science 203(1): 163-173, 1998. Type: Article	Mar 1 1999

Coding approaches to fault tolerance in combinational and dynamic systems Hadjicostis C., Kluwer Academic Publishers, Norwell, MA, 2001. 216, Type: Book (9780792376248)	Jul 2 2002

Model-based programming of fault-aware systems Williams B., Ingham M., Chung S., Elliott P., Hofbaur M., Sullivan G. AI Magazine 24(4): 61-75, 2004. Type: Article	May 11 2004

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy