BigDebug

Reviewed by Greg Wilson / 2016-06-05
Keywords: Debugging

Gulzar2016 Muhammad Ali Gulzar, Matteo Interlandi, Seunghyun Yoo, Sai Deep Tetali, Tyson Condie, Todd Millstein, and Miryung Kim: "BigDebug". Proceedings of the 38th International Conference on Software Engineering, 10.1145/2884781.2884813.

Developers use cloud computing platforms to process a large quantity of data in parallel when developing big data analytics. Debugging the massive parallel computations that run in today's data-centers is time consuming and error-prone. To address this challenge, we design a set of interactive, real-time debugging primitives for big data processing in Apache Spark, the next generation data-intensive scalable cloud computing platform. This requires rethinking the notion of step-through debugging in a traditional debugger such as gdb, because pausing the entire computation across distributed worker nodes causes significant delay and naively inspecting millions of records using a watchpoint is too time consuming for an end user.

First, BIGDEBUG's simulated breakpoints and on-demand watchpoints allow users to selectively examine distributed, intermediate data on the cloud with little overhead. Second, a user can also pinpoint a crash-inducing record and selectively resume relevant sub-computations after a quick fix. Third, a user can determine the root causes of errors (or delays) at the level of individual records through a fine-grained data provenance capability. Our evaluation shows that BIGDEBUG scales to terabytes and its record-level tracing incurs less than 25% overhead on average. It determines crash culprits orders of magnitude more accurately and provides up to 100% time saving compared to the baseline replay debugger. The results show that BIGDEBUG supports debugging at interactive speeds with minimal performance impact.

I stumbled across the journal Software: Practice & Experience in the summer of 1985, while I was working at Bell Northern Research in Ottawa. It was a revelation: here, at last, were the in-depth discussions of the design of real tools that I had always known had to be out there somewhere, but had never been able to find. The journal isn't nearly what it used to be, but this paper would have fit right in during its heyday. Its authors describe a pragmatic solution to a real-world problem, one that actually grapples with the issues practitioners face every day. Like the original paper on MapReduce, it should be required reading both in its own right and as an example of how to think when building tools.