Soft failure is the unsuccessful premature termination of a data parallel program, as opposed, for instance, to any kind of hardware failure. This paper, the first of its kind, undertakes a systematic evaluation of soft failures in a big data system. The study examines a random sample of 250 soft failures and provides a classification of root causes, as well as some insight on debugging and fixes.
This work is interesting for at least two reasons: it establishes a peer-reviewed benchmark on soft failures that is valuable for comparison with internal investigations of similar scope, and it provides material from the trenches for initial criteria to validate coding and software life cycle management practices in a rising discipline (big data) where much confusion and no established history exist. For instance, programmer’s error in misspelling a column name was one of the prominent sources of production soft failures. This may be surprising with regard to a traditional relational database management system (RDBMS) environment, where production schemas are static and column numbers are relatively small. However, in a big data system, there may be thousands of column names and the schema constraints are more dynamic. In association with undocumented schema churn, the chain of events leading to this type of programmer’s error becomes easier to understand. Once understood, one can take safeguards against recurrence.
There are no groundbreaking results or findings from this work, but its novelty and incremental contributions are surely welcome.