Large, production quality distributed systems still fail periodically, and do so sometimes catastrophically, where most or all users experience an outage or data loss. We present the result of a comprehensive study investigating 198 randomly selected, user-reported failures that occurred on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop MapReduce, and Redis, with the goal of understanding how one or multiple faults eventually evolve into a user-visible failure. We found that from a testing point of view, almost all failures require only 3 or fewer nodes to reproduce, which is good news considering that these services typically run on a very large number of nodes. However, multiple inputs are needed to trigger the failures with the order between them being important. Finally, we found the error logs of these systems typically contain sufficient data on both the errors and the input events that triggered the failure, enabling the diagnose and the reproduction of the production failures. We found the majority of catastrophic failures could easily have been prevented by performing simple testing on error handling code – the last line of defense – even without an understanding of the software design. We extracted three simple rules from the bugs that have lead to some of the catastrophic failures, and developed a static checker, Aspirator, capable of locating these bugs. Over 30% of the catastrophic failures would have been prevented had Aspirator been used and the identified bugs fixed. Running Aspirator on the code of 9 distributed systems located 143 bugs and bad practices that have been fixed or confirmed by the developers.
This is a wonderful study that explores the reasons why distributed systems fail in production by analyzing the root causes of around 200 confirmed system failures. The failures were reported against open-source projects that are used to implement distributed systems, including Hadoop and Redis.
The trends in failure modes among these systems are fascinating and actionable. The authors report many findings, but here are some key takeaways for practitioners:
[A]lmost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.
[I]n 58% of the catastrophic failures, the underlying faults could easily have been detected through simple testing of error handling code.
A majority (77%) of the failures require more than one input event to manifest, but most of the failures (90%) require no more than 3.
For a majority (84%) of the failures, all of their triggering events are logged.
Almost all catastrophic failures (92%) are the result of incorrect handling of non-fatal errors explicitly signaled in software.
The lessons for software engineers are clear: put more effort into writing and validating your error handling code, run your integration tests with a small number of nodes (but greater than one), and examine your application logs to diagnose failures.
This study also provides evidence for what Daniel Jackson calls the "small-scope hypothesis": you can identify most defects through tests with a small number of entities.
The authors developed a tool called Aspirator to help identify inadequate error-handling in Java applications. But even if you don't write software in Java, if you build distributed systems, you should read this paper.Comments powered by Disqus