Blue screens of death and cryptic error messages are all too common in contemporary systems, while graceful recovery and restart are all too rare. This paper documents a first step on the thousand-mile journey to self-healing operating systems.
Shapiro provides a clear description of the total lack of self-healing in modern software, and then describes the architectural approach he used in Sun Solaris 10 to address the problem. The paper is easy to read and understand, doesn’t waste time on accusation or self-promotion, and provides a small bibliography that will get an interested reader started quickly on topical research. Sun, IBM, HP, and Microsoft are all touting “self-healing,” “adaptive,” “dynamic,” and/or “crashless” system initiatives, but this is the first public revelation of a vendor’s actual work that I am aware of.
The paper is nine pages long, but reads like five; there are many graphics and figures consuming page real estate. I liked the paper so much I would like to write more praise, but it’s so short that verbosity seems inappropriate. Read it; it’s good.