Computing Reviews, the leading online review service for computing literature.

Search

A characteristic study on failures of production distributed data-parallel programs
Li S., Zhou H., Lin H., Xiao T., Lin H., Lin W., Xie T. ICSE 2013 (Proceedings of the 2013 International Conference on Software Engineering, San Francisco, CA, May 18-26, 2013)963-972.2013.Type:Proceedings

Date Reviewed: Aug 27 2014

Soft failure is the unsuccessful premature termination of a data parallel program, as opposed, for instance, to any kind of hardware failure. This paper, the first of its kind, undertakes a systematic evaluation of soft failures in a big data system. The study examines a random sample of 250 soft failures and provides a classification of root causes, as well as some insight on debugging and fixes. This work is interesting for at least two reasons: it establishes a peer-reviewed benchmark on soft failures that is valuable for comparison with internal investigations of similar scope, and it provides material from the trenches for initial criteria to validate coding and software life cycle management practices in a rising discipline (big data) where much confusion and no established history exist. For instance, programmer’s error in misspelling a column name was one of the prominent sources of production soft failures. This may be surprising with regard to a traditional relational database management system (RDBMS) environment, where production schemas are static and column numbers are relatively small. However, in a big data system, there may be thousands of column names and the schema constraints are more dynamic. In association with undocumented schema churn, the chain of events leading to this type of programmer’s error becomes easier to understand. Once understood, one can take safeguards against recurrence. There are no groundbreaking results or findings from this work, but its novelty and incremental contributions are surely welcome.

Reviewer: A. Squassabia	Review #: CR142667 (1412-1056)

Metrics (D.2.8 )

Parallelism And Concurrency (F.1.2 ... )

Software Development (K.6.3 ... )

Modes Of Computation (F.1.2 )

Software Management (K.6.3 )

Would you recommend this review?

yes

Other reviews under "Metrics":	Date

A comparison of time domains for software reliability models Musa J., Okumoto K. Journal of Systems and Software 4(4): 277-287, 1984. Type: Article	May 1 1985

On software equations Král J. Information Processing Letters 19(4): 191-196, 1984. Type: Article	Jun 1 1985

Software metrics: establishing a company-wide program Grady R., Caswell D., Prentice-Hall, Inc., Upper Saddle River, NJ, 1987. Type: Book (9789780138218447)	Apr 1 1988

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy