The Relevance of Classic Fuzz Testing

Reviewed by Greg Wilson / 2021-10-01
Keywords: Fuzz Testing, Software Quality

As a class exercise in 1998, Prof. Barton Miller had students throw randomly-generated inputs at standard Unix command-line utilities, and found that an astonishing 25–33% of those widely-used programs crashed. Thirty years later, with "fuzz testing" a well-established testing technique, Miller and colleagues repeated the experiment on Linux, FreeBSD, and MacOS. There had been little improvement:

9/74 (12%) of the programs tested crashed or hung on Linux, 15/78 (19%) on FreeBSD, and 12/76 (15%) on MacOS.
The causes were the same as they always had been: pointers, array bounds errors, and return codes from system calls not being checked.
Tools written in a more modern language (Rust) were no more reliable than those written in C.

There was some good news: several categories of errors that showed up in previous studies didn't show up this time, including problems related to end-of-file errors and division by zero. Overall, though, this paper is depressing reading, especially since open source fuzz-testing libraries are available in every major programming language and only take a few minutes to set up.

Miller2020 Barton Miller, Mengxiao Zhang, and Elisa Heymann: "The Relevance of Classic Fuzz Testing: Have We Solved This One?". IEEE Transactions on Software Engineering, 2020, 10.1109/tse.2020.3047766.

As fuzz testing has passed its 30th anniversary, and in the face of the incredible progress in fuzz testing techniques and tools, the question arises if the classic, basic fuzz technique is still useful and applicable? In that tradition, we have updated the basic fuzz tools and testing scripts and applied them to a large collection of Unix utilities on Linux, FreeBSD, and MacOS. As before, our failure criteria was whether the program crashed or hung. We found that 9 crash or hang out of 74 utilities on Linux, 15 out of 78 utilities on FreeBSD, and 12 out of 76 utilities on MacOS. A total of 24 different utilities failed across the three platforms. We note that these failure rates are somewhat higher than our in previous 1995, 2000, and 2006 studies of the reliability of command line utilities. In the basic fuzz tradition, we debugged each failed utility and categorized the causes the failures. Classic categories of failures, such as pointer and array errors and not checking return codes, were still broadly present in the current results. In addition, we found a couple of new categories of failures appearing. We present examples of these failures to illustrate the programming practices that allowed them to happen. As a side note, we tested the limited number of utilities available in a modern programming language (Rust) and found them to be of no better reliability than the standard ones.

« Python Coding Style Compliance on Stack Overflow

The Programmer's Brain »