The devil is in the details. Twenty-five million Europeans died from the bubonic plague, the “Black Death,” in the Middle Ages. The European outbreaks have been traced to a single Italian merchant ship that picked up goods from China during an outbreak there; it was unknown in Europe beforehand. Mighty Athens fell to the same plague centuries earlier, when its citizens crowded together in the city during a Spartan siege. What triggered these outbreaks? Fleas, which transmitted the plague from rat hosts to humans.
We now have computer systems reaching around the globe, spreading themselves farther and thinner (as in cell phones, and other handhelds). Is there some small detail that could bring them crashing down?
The integrity of every transactional system, whether it’s a small database, or the largest business-to-business (B2B) network, depends on clocks. If the order of transactions cannot be determined, a transactional system can’t function. Imagine transactions being sent to a bank account. They can produce very different real-world results if you vary the order in which you apply them. Synchronized clocks are the heartbeat of distributed systems.
This paper was published in the Proceedings of the 14th Annual ACM Symposium on Principles of Distributed Computing, in August 1995, as “Self-stabilizing clock synchronization with Byzantine faults.” I could only find a one-page summary when I looked it up in the ACM Digital Library today, but I used the entire paper a few years ago to design some middleware.
Dolev and Welch build on the work of Lamport and Melliar-Smith [1]. They make several distinct contributions. First, they expand the robustness of the recovery protocol, to function with up to one-third of the processors known or suspected to be faulty. Two, they also remove the specification of unbounded clocks that was present in previous work, which makes their protocols usable on real-world systems. Third, they make the protocol self-stabilizing, so that no initial state has to be globally asserted, and to allow resynchronization (recovery) after transient faults, like more than a third of the processors becoming temporarily faulty. Finally, they present two protocols, one of which is dependent on a shared pulse.
The article is 19 pages long, very readable, and, as mentioned above, was used to create real software.