0:00:00.060,0:00:04.560 My talk is going to be a little different. So my talk is going to be on a surprising 0:00:04.560,0:00:09.960 efficiency and exponential cost of fuzzing. So how can something be surprisingly efficient 0:00:09.960,0:00:14.220 and also come with an exponential cost? So the actual purpose of this talk is 0:00:14.220,0:00:20.640 to introduce to the weirdness of randomness and probability theory. 0:00:20.640,0:00:26.040 So given it - given a name of this event I won't talk much about the theory 0:00:26.040,0:00:34.380 but I want to talk about intuition. And one of the key messages that I want 0:00:34.380,0:00:39.120 you to take away from this talk is that we might have strong intuitions about solving a problem 0:00:39.120,0:00:44.580 but without a deep understanding of the problem sometimes our intuitions might lead us astray. 0:00:46.260,0:00:51.960 So what is fuzzing? Fuzzing is a testing process. 0:00:51.960,0:00:54.960 So here we are given a program and that program takes four characters. 0:00:56.520,0:01:01.740 One approach to test this program is called white box fuzzing - white box because the first 0:01:01.740,0:01:05.940 thing is that the analysis can actually see the internals of that program. 0:01:07.500,0:01:13.380 The program has four branches and we could now try to explore every path in 0:01:13.380,0:01:17.700 that program simply by trying to capture the conditions of each of these branches. 0:01:17.700,0:01:24.540 Say one branch is - one input would be exercising the path where the first character is not a b, 0:01:24.540,0:01:30.300 another input would exercise that path where the first input is a b but the second input 0:01:30.300,0:01:32.280 is not an a, and so on. 0:01:32.280,0:01:37.680 So we could explore all paths and at the end after exploring the fifth path - number 0:01:37.680,0:01:40.500 five - we would see that the program actually crashes. 0:01:40.500,0:01:45.420 And so in this sense this approach called white box fuzzing is actually most effective. 0:01:45.420,0:01:47.760 Why most effective? Because it can 0:01:47.760,0:01:52.920 actually prove the absence of an error, in this case of an assertion violation, 0:01:52.920,0:01:55.260 simply by enumerating all paths through the program. 0:01:55.260,0:01:59.520 Because you're making some assumptions but in principle we could say that white box buzzing 0:01:59.520,0:02:05.160 would be able to prove the absence of error, in this sense be most effective. 0:02:06.420,0:02:07.740 And also it's quite efficient, 0:02:07.740,0:02:10.650 so if we - suppose we are - 0:02:10.650,0:02:12.840 we have these five - there are five paths in that program. 0:02:12.840,0:02:20.640 If we sample a path at random without replacement we would expect to take about three inputs to find 0:02:20.640,0:02:22.500 that crashing bug, right, 0:02:22.500,0:02:27.180 generated by enumerating all of these paths using this technique called white box buzzing. 0:02:28.980,0:02:33.240 Now on the other side of the spectrum there's something called black box fuzzing. 0:02:33.240,0:02:34.800 Black box fuzzing doesn't know anything about the 0:02:34.800,0:02:37.500 program - what it would do - it simply generates random 0:02:37.500,0:02:42.360 inputs for that program. So in - on most machines a 0:02:42.360,0:02:47.520 character can take one of 256 values, and suppose we just randomly generate 0:02:48.600,0:02:54.120 these values using sampling - uniform at random with replacement. 0:02:56.400,0:02:59.580 Of course it can never prove the absence of errors, 0:02:59.580,0:03:05.880 in fact there's this great quote by Dijkstra saying program testing can 0:03:05.880,0:03:08.700 be used to show the presence of bugs but never too show their absence. 0:03:09.300,0:03:14.340 It's almost more recently we've looked into this problem of giving 0:03:14.340,0:03:20.760 testing some kind of statistical guarantees so we can somehow compute what is called a 0:03:20.760,0:03:26.220 residual risk after a certain amount of testing but the effect remains, 0:03:26.220,0:03:31.740 black box fuzzing is not most effective - it cannot prove the absence of errors. 0:03:31.740,0:03:37.200 Okay so then if we look at this from a statistical perspective, 0:03:37.800,0:03:42.000 I said earlier that white box fuzzing would require about three inputs and 0:03:42.000,0:03:44.880 expectation to find that bug. Now black box fuzzing, 0:03:44.880,0:03:49.320 if we sample each input each value for each character uniformly at random, 0:03:49.320,0:03:52.140 we would expect to generate about four billion inputs before 0:03:52.140,0:03:54.360 discovering this bug. This sounds a lot. 0:03:55.200,0:03:58.680 So then white box buzzing should be better, right, it's the most effective, 0:03:58.680,0:04:02.340 it's also quite efficient. And here it's where it's going 0:04:02.340,0:04:05.280 wrong for the first time, at least not always. 0:04:06.240,0:04:13.920 Sometimes black box fuzzing wins. And this was discovered 30-40 0:04:13.920,0:04:18.840 years ago when people started doing experiments with these two techniques 0:04:20.940,0:04:25.500 with the hypothetically most effective technique and with just a simple random sampling. 0:04:25.500,0:04:34.020 Somehow they found that given some limited time, this very simple, very dumb, technique wins and 0:04:34.020,0:04:41.040 finds more bugs than the most effective technique. And so we looked at this very recently 0:04:43.140,0:04:48.180 from a statistical perspective and found that if a white box 0:04:48.180,0:04:52.980 fuzzer takes too long per input our black box fuzzer outperforms the white box fuzzer. 0:04:52.980,0:04:55.560 Right, if the generation for one input takes 0:04:55.560,0:04:59.760 too long then the black box fuzzer outperforms. Because the black box fuzzer generates 0:05:00.300,0:05:04.440 inputs very fast. So on my machine it takes about 6.3 0:05:04.440,0:05:08.820 seconds to generate 4 billion inputs and if I had 100 machines 0:05:09.540,0:05:15.840 it would take 63 milliseconds so we can easily scale bug finding 0:05:15.840,0:05:19.740 because because black box fuzzing is essentially embarrassingly parallel. 0:05:21.900,0:05:27.540 And so in this paper we actually give a probabilistic bound on this maximum time. 0:05:29.520,0:05:33.360 So then is black box fuzzing the best that we can do? 0:05:35.940,0:05:39.780 And the answer here is wrong, not - we can do even better than that. 0:05:40.440,0:05:44.340 So what I presented here was a generational black box fuzzer. 0:05:44.340,0:05:45.540 What it means is, I generate 0:05:46.980,0:05:51.540 values for each of these characters, but I could also - instead of trying 0:05:51.540,0:05:56.040 to generate inputs from scratch, which is the generational approach, 0:05:56.040,0:06:01.440 maybe we can reuse existing inputs, and this is called mutational black box testing. 0:06:02.160,0:06:09.480 So suppose we have a seed input which is "bad?" The bad input is "bad!" 0:06:10.260,0:06:14.400 So if we had a mutational fuzzer which simply selects the four - which 0:06:14.400,0:06:18.540 simply selects the character at random and then chooses a value for that at random 0:06:18.540,0:06:24.540 then with a probability of one over four times one over 256 0:06:24.540,0:06:31.740 an expectation of about 1000 inputs which is much less than four billion inputs. 0:06:32.340,0:06:38.340 But I cheated a little bit here. We chose a really good seed to start with, 0:06:38.340,0:06:42.720 so can we do even better? Can we somehow automatically 0:06:42.720,0:06:46.800 discover this seed input? And this is where the 0:06:46.800,0:06:52.380 third approach gray box fuzzing comes in. Can we make can we take the advantage of black box 0:06:52.380,0:06:55.740 fuzzing and the advantage of white box fuzzing, can we kind of combine it? 0:06:56.280,0:06:59.100 And gray box fuzzing - because we don't actually analyze the program, 0:06:59.100,0:07:03.300 but we take some feedback called coverage feedback from gray box buzzing 0:07:03.300,0:07:05.760 and we add generated inputs to the corpus which increase coverage. 0:07:06.780,0:07:12.240 So suppose we just start with a random input. The probability that we generate the first 0:07:12.240,0:07:16.080 coverage increasing input is one over one thousand, 0:07:16.080,0:07:20.880 which means an expectation we require about 1024 inputs to 0:07:20.880,0:07:26.640 generate the first coverage increasing input. We add it to the corpus then we select this 0:07:28.800,0:07:33.540 next seed uniformly at random and in this case - this seed - and 0:07:33.540,0:07:37.980 the probability that we chose this seed, this character and generate 'a' as the value 0:07:37.980,0:07:42.420 requires about 2000 inputs. So we go on like that, 0:07:42.420,0:07:48.300 and easily starting from a random input we can generate - we can 0:07:48.300,0:07:53.640 find a bug using only 10,000 inputs. And now my machine just takes 150 microseconds. 0:07:53.640,0:07:58.620 So we - so this is much faster than a symbolic execution tool which requires 0:07:58.620,0:08:03.420 all this machinery - constraint solving and encoding and so on for three inputs. 0:08:05.280,0:08:09.120 And in the work that we presented at CCS16 we also boosted gray 0:08:09.120,0:08:13.860 box fuzzing simply by choosing that seed which exercises the lowest probability domain, 0:08:13.860,0:08:18.060 and we go down further to four thousand inputs and to 55 microseconds. 0:08:19.560,0:08:24.960 Okay so the insight is, if you have a really efficient fuzzer 0:08:24.960,0:08:30.060 let's just throw more machines at the problem. You remember on my machine it takes 6.3 seconds, 0:08:30.060,0:08:34.620 on 100 machines it takes 63 milliseconds to find the same bug. 0:08:36.540,0:08:40.920 So then you might think, well, if I have - I can 0:08:40.920,0:08:44.340 take X times small machines means I can find X times more bugs, right? 0:08:45.120,0:08:50.520 And again we are wrong. So this is an empirical cost. 0:08:50.520,0:08:56.460 We looked at a lot of data where we see this is an exponentially increasing number of machines. 0:08:56.460,0:08:59.400 It's a log scale here. And this is the number of additional 0:08:59.400,0:09:04.620 vulnerabilities discovered within the same time. And we see that on a linear increase in the number 0:09:04.620,0:09:08.880 of new vulnerabilities discovered requires an exponential increase in the number of machines. 0:09:08.880,0:09:13.980 And this is the exponential cost. And a little bit of an explanation here, 0:09:13.980,0:09:17.580 this is just the intuition. If you think about just one bug 0:09:18.360,0:09:22.980 and we increase the number of machines exponentially for a long time we won't - 0:09:22.980,0:09:27.540 for a long time means the number of machines - we won't see that bug 0:09:27.540,0:09:32.280 and there's almost linear increase at a certain number of machines 0:09:32.280,0:09:37.080 and then we are again almost - we have discovered this bug and so 0:09:37.080,0:09:41.160 we kind of go on with a straight line. If we now - instead of having just one 0:09:41.160,0:09:45.360 vulnerability we have 10 vulnerabilities we kind of get these quickly lines and 0:09:45.360,0:09:48.660 they almost look like straight lines and we increase the number of - the 0:09:48.660,0:09:53.340 number of vulnerabilities, number of bugs or whatever you want to measure you find it - 0:09:53.340,0:09:58.920 you get more and more linear. And the reason for this is that we have these 0:09:58.920,0:10:04.260 kind of constant - we never see that either - we never see the one to be discovered or we 0:10:04.260,0:10:07.800 always see the one already discovered and in between we have this 0:10:07.800,0:10:13.020 kind of almost linear increases. This is how we kind of get this linear 0:10:13.020,0:10:19.080 increase for an exponential number of machines. Okay so intuitively each new vulnerability 0:10:19.080,0:10:22.200 requires some more resources than the previous vulnerability 0:10:24.180,0:10:28.980 so the constant rate of vulnerability discovery requires exponential amount of resources. 0:10:31.560,0:10:35.700 So what I showed you is white box fuzzing - we have a technique which 0:10:35.700,0:10:38.520 we really love to work with, which is really smart, 0:10:38.520,0:10:41.160 which analyzes the program, which in fact is the most effective, 0:10:41.160,0:10:45.540 but in practice it is easily outperformed by a very dumb 0:10:45.540,0:10:48.780 technique called black box fuzzing, by simply randomly generating inputs. 0:10:48.780,0:10:53.880 And in fact this is so efficient that we can easily scale that across a lot 0:10:53.880,0:10:58.620 of machines and find the same the same number of bugs X times faster. 0:11:00.060,0:11:03.420 I also talked about a machine that kind of leverages the efficiency 0:11:03.420,0:11:07.800 of black box fuzzing but it's more like - but it enumerates paths like whitebox fuzzing 0:11:07.800,0:11:12.780 but then I explained how even that gray box fuzzing technique includes black box fuzzing 0:11:12.780,0:11:17.160 and gray box fuzzing are somehow affected by an exponential cost. 0:11:18.000,0:11:23.820 Thank you so much. All right thank you very much Marcel. 0:11:25.200,0:11:28.620 It's great to see this kind of quantitative approach to this problem. 0:11:29.700,0:11:36.180 A question coming in from one of the viewers is, how much of the statistics do I need to understand 0:11:36.180,0:11:43.560 to be able to apply these sorts of techniques? Oh so there's - that's - to apply it has no 0:11:43.560,0:11:48.960 need to understand statistics. It's more like - to understand 0:11:48.960,0:11:55.860 the statistics is more a help for you that - For instance, the insight that you cannot just 0:11:55.860,0:12:00.480 throw more machines at the problem and then hopefully you find more bugs, doesn't work. 0:12:00.480,0:12:06.540 At some point you kind of run out of machines because the cost for 0:12:06.540,0:12:13.560 finding more vulnerabilities is exponential. That's interesting to understand why this is the 0:12:13.560,0:12:17.460 case and we use statistics and probability theory to explain why this is the case. 0:12:17.460,0:12:21.720 But when you apply fuzzing you don't need to know, you don't need to know understand the statistics. 0:12:21.720,0:12:29.580 Maybe one other thing that we explored is that - we are often interested in the 0:12:29.580,0:12:32.100 probability that we find about - if we have not found a bug. 0:12:32.100,0:12:37.500 Suppose you have run a campaign for 24 hours and you have not found any bug, 0:12:37.500,0:12:40.560 You still want to know what is the probability that we find a 0:12:40.560,0:12:45.900 bug with just one more input generated and this is where you can use statistics.