Frequency Distribution of Error Message
Reviewed by Greg Wilson / 2016-06-12
Keywords: Programming Languages
Pritchard2015 David Pritchard: "Frequency distribution of error messages". Proceedings of the 6th Workshop on Evaluation and Usability of Programming Languages and Tools, 10.1145/2846680.2846681.
Which programming error messages are the most common? We investigate this question, motivated by writing error explanations for novices. We consider large data sets in Python and Java that include both syntax and run-time errors. In both data sets, after grouping essentially identical messages, the error message frequencies empirically resemble Zipf-Mandelbrot distributions. We use a maximum-likelihood approach to fit the distribution parameters. This gives one possible way to contrast languages or compilers quantitatively.
Based on a large corpus of error messages, the 5 most common errors in Python programs are:
179624 | SyntaxError: invalid syntax |
97186 | NameError: name 'NAME' is not defined |
76026 | EOFError: EOF when reading a line |
26097 | SyntaxError: unexpected EOF while parsing |
20758 | IndentationError: unindent does not match any outer indentation level |
and the 5 most common in Java are:
702102 | cannot find symbol - variable NAME |
407776 | ';' expected |
280874 | cannot find symbol - method NAME |
197213 | cannot find symbol - class NAME |
183908 | incompatible types |
What's more, their frequency has a power law (or "long tail") distribution, which suggests that improving reporting for just a handful of errors would have a disproportionate effect on usability. But my favorite part of this paper comes toward the end of Section 3.2:
Can this [relationship] be plausible: is the total number of possible errors infinite? We will accept this as a reasonable hypothesis…
to which I can only say, "Amen."