Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Best of 2016 Recommended by Editor Recommended by Reviewer Recommended by Reader
Search
Bits and bugs : a scientific and historical review of software failures in computational science
Huckle T., Neckel T., Society for Industrial and Applied Mathematics, Philadelphia, PA, 2019. 251 pp. Type: Book (978-1-611975-55-0)
Date Reviewed: Nov 20 2019

Niels Bohr, in confronting the subtleties and paradoxes of quantum theory, said to one of his many famous students and acolytes, “These issues are so serious that one can only joke about them” [1]. The vernacular main title of this excellent and very important book brought Bohr’s words to mind (though the title is more light than humorous). My message here: do not be put off or misled by the main title, as this book is not a popularization--zany, sensationalistic, or otherwise--of mishaps and disasters engendered by the use of computers as controlling elements. On the contrary, the book is fully what its subtitle states, namely, a scientific and historical review of software failures in computational science. I add here that hardware/firmware errors are also treated, as with the notorious floating-point division bug discovered and confirmed in 1994 (see Section 8.3).

The book’s writing and organization are excellent. One would hardly guess that the authors’ native language is not English. A minor comment (nitpick) on usage, I suppose: the failures are not in “computational science,” but rather in designers’ and programmers’ limited awareness of the very broad totality of pitfalls in bringing digital computation and control to bear on the real world, a dangerously limited awareness, even after all this time, as the saying goes. Though the book’s length is moderate, it is a virtual encyclopedia of failure modes and effects analysis (FMEA).

The authors “analyze all aspects of the selected bugs in breadth ... as well as in depth (getting the complete and scientifically sound picture of a problem).” This claim (from the introduction) is, in my opinion, fully justified. Furthermore, the interested layperson is a beneficiary as well, by means of 24 highlighted text “excursions,” such as “Exception Handling and Protection of Operations,” “Finite Element Method, Priority, and Semaphores in Concurrent Computing,” “Basic Aspects of Radiation Therapy Complexity,” and “The Traveling Salesman Problem.” (It was difficult to choose these from the equally interesting and pithy complement of excursions, so I threw a virtual dart a few times.)

As mentioned previously, the book is characterized by depth, thoroughness, and breadth, treating the entire gamut of computer-in-the-loop-engendered errors. It is also as up to date as a book can be, and treats, for example, cryptocurrency, a present-day rage. In my opinion, it is most positive and essential that this work in fact adheres strictly to its mission, namely, to expound the problem, not the technical solution (an example of a solution being the use of formal methods--my professional interest, but appropriately out of the book’s scope). Failure to stick to the knitting would have diluted, if not vitiated, exposition of both problems and solutions.

For reasons of time, space, and readers’ patience, I’ll mention only three examples that are treated in depth and breadth in this outstanding book: the Therac-25 radiation therapy machine’s fatal design and software errors (early 1980s); the Ariane 5’s in-flight aborted launch (1996); and prevention at design time of loss of numerical significance in train-control software (2015). The first two were (also) exemplars in my in-house formal methods courses that accompanied use of formal methods in train-control projects, while the third comprised part of an actually implemented design of a train-control system. (I assert that these remain freshly relevant today.)

The radiation therapy section builds on, and gives full attribution to, the work of Nancy Leveson [2] (a book I highly recommend) and stands on its own as indispensable guidance on hazard elimination and mishap prevention. Excursion 22 is a clear exposition on race conditions, from which novice through expert could (re)learn. “Software Problems,” Parts A and B, covers almost all software problems and relates these to consequent serious injuries and deaths. The reliability/safety distinction, which seems to need perennial repetition, is asserted: “When a program is 98 [percent] reliable, it is, however, not 100 [percent] safe.”

Integer overflow was the proximate cause of the commanded abort of Ariane 5. This quintessentially preventable failure cost $500 million. Excursions 1–4 explain “machine numbers” and overflow very well, as well as the related exception handling. The discussion also describes, in my words, the software upgrade from Ariane 4 to Ariane 5 and the hazards of incomplete casting of numbers (such as from integer to floating point). Though I had studied Ariane 5, this book taught me significantly more.

Both of the previous examples also reveal the very serious hazards associated with unquestioned reuse of software. (Percentage reuse is an enthusiastically touted metric in proposals and in budget-and-schedule meetings. Negative comments are, to say the least, unwelcome.)

The problem of loss of numerical significance, which has been and can be a deadly hazard, precedes digital computers; however, in my experience, it has been a second-class citizen in both design and implementation. This book, which has several excursions on this subject, is most welcome in its clarity and authoritativeness.

I recommend this book most highly and will keep it at hand for a long time to come.

Reviewer:  George Hacken Review #: CR146787 (2001-0002)
1) Casimir, H. B. G. Haphazard reality: half a century of science. Harper & Row, New York, NY, 1983.
2) Leveson, N. G. SafeWare: system safety and computers. Addison-Wesley, Reading, MA, 1995.
Bookmark and Share
  Editor Recommended
Featured Reviewer
 
 
Testing And Debugging (D.2.5 )
 
Would you recommend this review?
yes
no
Other reviews under "Testing And Debugging": Date
Software defect removal
Dunn R., McGraw-Hill, Inc., New York, NY, 1984. Type: Book (9789780070183131)
Mar 1 1985
On the optimum checkpoint selection problem
Toueg S., Babaoglu O. SIAM Journal on Computing 13(3): 630-649, 1984. Type: Article
Mar 1 1985
Software testing management
Royer T., Prentice-Hall, Inc., Upper Saddle River, NJ, 1993. Type: Book (9780135329870)
Mar 1 1994
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy