According to the authors, “Byzantine Agreement (BA) is the problem of making a set of processors, some of which may fail in arbitrary ways, agree on a common `value.’” The authors discuss ways in which BA can contribute to reliable processing in database systems. In BA, no one is trusted. If the general tells each of three colonels to “advance,” they ask each other, “what did the general tell you?”; then they ask, “what did the other colonel tell you the general said?” If every answer to every question is “advance,” they advance. Otherwise, they do whatever their standard operating procedure calls for. This corresponds to a transaction processing system in which identical nodes receive a transaction from a single source and then check with each other to see if they all (or a specified number) have the same transaction before they process it and update their (identical) databases. All this takes time and resources. The number of messages exchanged grows rapidly as the number of nodes increases. The number of nodes is large; if the system goal is to be reliable when m nodes have failed, 2m +:9- T1 nodes are required for BA. Since these nodes are not only identical but are required to work in lockstep, the large resource cost does not contribute to throughput. It makes you wonder if BA is worth the effort. The authors do not try very hard to sell it. They recommend BA only for distributing transactions among processing nodes. BA is probably of most interest in critical realtime environments where transactions are short, fixed-length, with a predictable arrival rate, and logically independent of one another. The network connecting the nodes would preferably be a true broadcast medium, in which all nodes hear every message at virtually the same time, thus speeding up the BA. Lots of high-value applications of this type exist in weapons systems, automatic control systems, and other systems in which irrevocable actions can be made by fallible machines.
This is a hard paper to read. BA is complicated, and the authors have not supplied much help in their choice of illustrations and notation. (A couple of misplaced subscripts compound the problem.) No quantitative examples are given; thus, you cannot compare BA to other methods nor can you judge what will happen when transactions arrive too fast or network delivery times vary. A discussion of the recovery process when an insane node is repaired does not show how the lengthy asynchronous reloading of the database in that node affects the synchronous BA procedures that continue in the perfect parts of the system. Overall, the paper does not equip the reader to evaluate BA as a solution to reliability problems.