Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Abstracting the geniuses away from failure testing
Alvaro P., Tymon S. Communications of the ACM61 (1):54-61,2018.Type:Article
Date Reviewed: May 9 2018

Large-scale distributed systems are difficult to test using traditional failure-testing or fault-injection techniques. Even recent approaches such as chaos engineering rely on experienced experts who can observe the system, propose hypotheses of its behavior, and formulate experiments to validate the results of variations. The process assumes the availability of human expertise, a formal specification, and the source code.

This article presents a lineage-driven fault injection (LDFI) approach that automates the process, starting with successful outcomes and reasoning backward through call-graph traces and data provenance. It was successfully applied at Netflix. I strongly recommend this excellent introductory article if you are new to chaos engineering. It gives enlightening ideas to novices. The writing is smooth and interesting.

If you are a practicing software tester, however, you may want more than just bedtime reading. For example, in order to apply LDFI, we still need an executable specification and a correctness specification, including invariant definitions. The invariants are based on homeostatic states, which are often mistaken as steady states in chaos engineering literature. The former refers to a relatively stable state of equilibrium such as our body temperature of 37 degrees Celsius under normal circumstances, whereas the latter refers to an unvarying condition such as our bodies at room temperature after death. Furthermore, we need to work around nonreplayability and nondeterminism in real-life distributed systems. I suggest that readers refer to Rosenthal et al. [1] for precise details of chaos engineering and Alvaro et al. [2] for technical assumptions and consequences of LDFI.

Reviewer:  T.H. Tse Review #: CR146024 (1807-0384)
1) Rosenthal, C.; Hochstein, L.; Blohowiak, A.; Jones, N.; Basiri, A. Chaos engineering: building confidence in system behavior through experiments. O'Reilly, Sebastopol, CA, 2017.
2) Alvaro, P.; Andrus, K.; Sanden, C.; Rosenthal, C.; Basiri, A.; Hochstein, L. Automating failure testing research at Internet scale. In Proc. of the 7th ACM Symposium on Cloud Computing (SoCC '16). ACM, New York, NY, 2016, 17–28.
Bookmark and Share
Testing And Debugging (D.2.5 )
Would you recommend this review?
Other reviews under "Testing And Debugging": Date
Software defect removal
Dunn R., McGraw-Hill, Inc., New York, NY, 1984. Type: Book (9789780070183131)
Mar 1 1985
On the optimum checkpoint selection problem
Toueg S., Babaoglu O. SIAM Journal on Computing 13(3): 630-649, 1984. Type: Article
Mar 1 1985
Software testing management
Royer T., Prentice-Hall, Inc., Upper Saddle River, NJ, 1993. Type: Book (9780135329870)
Mar 1 1994

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2023 ThinkLoud®
Terms of Use
| Privacy Policy