A fault-tolerant implementation for parallel execution of data- and computation-intensive programs that require persistent storage and fault-tolerant facilities is described. The well-known “bag of tasks” structuring technique is used together with remote procedure calls (RPCs) for parallel execution of tasks in a network of workstations. The bag of tasks balances load between an arbitrary collection of slave processes. On the other hand, a single bag of tasks is insufficient to express more general structures in cases where a task depends on more than one other task. In this case, an additional synchronization mechanism is required. Derivation of the upper and lower limits for the computation time required for the parallel processing of tasks is outlined. The estimates derived are demonstrated on three applications, namely, the publicly available ray tracing package rayshade; matrix multiplication; and Cholesky factorization.
The authors assess performance based on these three examples. Some of the most interesting conclusions reached in this section are, first, that increasing task size improves performance. Second, the difference between the fault-tolerant execution time and the non-fault-tolerant execution time is not very significant. Third, for sufficiently large granularity, the cost of fault tolerance is small, but fault tolerance provides significant improvement in the event of a failure. Finally, doubling the computing speed of the workstations would not affect the achievable execution rate, but would reduce the number of machines required.
With the increasing number of networks of workstations and the increasing need for data- and computation-intensive applications, the implementation outlined and the conclusions reached in the paper have a high likelihood of being used extensively in the future.