Frequency Distribution of Error Message

Reviewed by Greg Wilson / 2016-06-12
Keywords: Programming Languages

Pritchard2015 David Pritchard: "Frequency distribution of error messages". Proceedings of the 6th Workshop on Evaluation and Usability of Programming Languages and Tools, 10.1145/2846680.2846681.

Which programming error messages are the most common? We investigate this question, motivated by writing error explanations for novices. We consider large data sets in Python and Java that include both syntax and run-time errors. In both data sets, after grouping essentially identical messages, the error message frequencies empirically resemble Zipf-Mandelbrot distributions. We use a maximum-likelihood approach to fit the distribution parameters. This gives one possible way to contrast languages or compilers quantitatively.

Based on a large corpus of error messages, the 5 most common errors in Python programs are:

179624 SyntaxError: invalid syntax
97186 NameError: name 'NAME' is not defined
76026 EOFError: EOF when reading a line
26097 SyntaxError: unexpected EOF while parsing
20758 IndentationError: unindent does not match any outer indentation level

and the 5 most common in Java are:

702102 cannot find symbol - variable NAME
407776 ';' expected
280874 cannot find symbol - method NAME
197213 cannot find symbol - class NAME
183908 incompatible types

What's more, their frequency has a power law (or "long tail") distribution, which suggests that improving reporting for just a handful of errors would have a disproportionate effect on usability. But my favorite part of this paper comes toward the end of Section 3.2:

Can this [relationship] be plausible: is the total number of possible errors infinite? We will accept this as a reasonable hypothesis…

to which I can only say, "Amen."