Server rooms are often populated with powerful hardware running hypervisors, in turn hosting many guest virtual machines. Here, a soft hardware error may propagate in surprising ways affecting the hypervisor itself, or one or more guests. In contrast, if a single operating system (OS) runs directly on hardware, a soft error affects only that single machine. This paper investigates experimentally soft error propagation in a virtualized server room. A soft error happens when central processing unit (CPU) registers are afflicted by random bit flipping from possible--but unwanted and unplanned--survivable error sources. A soft error is different from, for instance, a hard failure caused by overheating.
This work is divided into two major parts: an extensive study of the propagation of soft errors, and a shorter discussion of options for fault tolerance. The propagation study injects faults using a simulation environment, and draws measurements and observations from instruction traces, fault locations, and crash analysis. The discussion considers existing fault tolerance techniques in light of those experimental results.
This work is a valuable contribution toward understanding virtualization behavior with regard to soft errors. However, it does not examine the real causes of random bit flipping; instead, it assumes occurrences as a given, emulating them by deterministic fault injection. Failing to ascertain how likely a failure mode may be does not compromise the study of propagation, but can be a shortcoming in the assessment of fault tolerance.