Frequency Distribution of Error Message

Reviewed by Greg Wilson / 2016-06-12
Keywords: Programming Languages

Pritchard2015 David Pritchard: "Frequency distribution of error messages". Proceedings of the 6th Workshop on Evaluation and Usability of Programming Languages and Tools, 10.1145/2846680.2846681.

Which programming error messages are the most common? We investigate this question, motivated by writing error explanations for novices. We consider large data sets in Python and Java that include both syntax and run-time errors. In both data sets, after grouping essentially identical messages, the error message frequencies empirically resemble Zipf-Mandelbrot distributions. We use a maximum-likelihood approach to fit the distribution parameters. This gives one possible way to contrast languages or compilers quantitatively.

Based on a large corpus of error messages, the 5 most common errors in Python programs are:

179624	SyntaxError: invalid syntax
97186	NameError: name 'NAME' is not defined
76026	EOFError: EOF when reading a line
26097	SyntaxError: unexpected EOF while parsing
20758	IndentationError: unindent does not match any outer indentation level

and the 5 most common in Java are:

702102	cannot find symbol - variable NAME
407776	';' expected
280874	cannot find symbol - method NAME
197213	cannot find symbol - class NAME
183908	incompatible types

What's more, their frequency has a power law (or "long tail") distribution, which suggests that improving reporting for just a handful of errors would have a disproportionate effect on usability. But my favorite part of this paper comes toward the end of Section 3.2:

Can this [relationship] be plausible: is the total number of possible errors infinite? We will accept this as a reasonable hypothesis…

to which I can only say, "Amen."

« Parallelism in Open Source Projects

Polymorphism in Python »