Three algorithms for assigning jobs to hosts in a network of arbitrarily chosen topology were studied using discrete-event simulation techniques (GPSS). Jobs arrive at the five nodes of a network at independent arrival rates varying from “light,” to “moderate,” to “heavy.” Each host periodically evaluates the balance of jobs across the network and transmits part of its queue of jobs to other, less busy nodes if the algorithm being used and its parameters indicates that this should help reduce overall response time in the network. All three algorithms studied perform “considerably better” in terms of response time than the situation in which no job movement across hosts is allowed.
Some of the assumptions made in the model restrict its generality, although it is not clear to what extent. The first such assumption is that a job which arrives at an individual host may be executed equivalently by any other host in the network. This assumption implies, for example, that there is a network-wide file system which provides equivalent service to all hosts, even though the distance between jobs and hosts along the communication network used for moving jobs varies between 0 and 2. A second assumption ignores process scheduling by modeling FCFS service at each host with no multiprogramming. This assumption avoids the problem of moving partially executed jobs through the network, and assures a maximum-length waiting queue at each host from which to draw candidates for movement. A more realistic assumption of multiprogramming would either reduce the population of job candidates for movement or require extra overhead in the network for transmitting the state of a job’s processes and i/o operations. As a result, the findings would apply to a network of identical processors providing computebound service to a population of geographically distributed users, but the applicability beyond this case is simply unknown.
What the paper does do is provide a good first step in dealing with the problem of distributed algorithms for job scheduling. Variables investigated, in addition to system-wide load, include traffic density in the network, the scheduling interval, cost of the scheduling algorithm, degraded state information, and elimination of large jobs from consideration for movement. The algorithms themselves are straightforward, involving comparison of a host’s own busyness with its estimate of the busyness of the other hosts in the system. To improve stability of the system, the algorithms limit the rate at which a host will move jobs in various ways. Although the algorithms need to be tuned to provide stable performance (and must be turned off under heavy system loads), they do in general provide improved system performance at moderate cost.