0:00:00.060,0:00:04.560
My talk is going to be a little different. So my talk is going to be on a surprising
0:00:04.560,0:00:09.960
efficiency and exponential cost of fuzzing. So how can something be surprisingly efficient
0:00:09.960,0:00:14.220
and also come with an exponential cost? So the actual purpose of this talk is
0:00:14.220,0:00:20.640
to introduce to the weirdness of randomness and probability theory.
0:00:20.640,0:00:26.040
So given it - given a name of this event I won't talk much about the theory
0:00:26.040,0:00:34.380
but I want to talk about intuition. And one of the key messages that I want
0:00:34.380,0:00:39.120
you to take away from this talk is that we might have strong intuitions about solving a problem
0:00:39.120,0:00:44.580
but without a deep understanding of the problem sometimes our intuitions might lead us astray.
0:00:46.260,0:00:51.960
So what is fuzzing? Fuzzing is a testing process.
0:00:51.960,0:00:54.960
So here we are given a program and that program takes four characters.
0:00:56.520,0:01:01.740
One approach to test this program is called white box fuzzing - white box because the first
0:01:01.740,0:01:05.940
thing is that the analysis can actually see the internals of that program.
0:01:07.500,0:01:13.380
The program has four branches and we could now try to explore every path in
0:01:13.380,0:01:17.700
that program simply by trying to capture the conditions of each of these branches.
0:01:17.700,0:01:24.540
Say one branch is - one input would be exercising the path where the first character is not a b,
0:01:24.540,0:01:30.300
another input would exercise that path where the first input is a b but the second input
0:01:30.300,0:01:32.280
is not an a, and so on.
0:01:32.280,0:01:37.680
So we could explore all paths and at the end after exploring the fifth path - number
0:01:37.680,0:01:40.500
five - we would see that the program actually crashes.
0:01:40.500,0:01:45.420
And so in this sense this approach called white box fuzzing is actually most effective.
0:01:45.420,0:01:47.760
Why most effective? Because it can
0:01:47.760,0:01:52.920
actually prove the absence of an error, in this case of an assertion violation,
0:01:52.920,0:01:55.260
simply by enumerating all paths through the program.
0:01:55.260,0:01:59.520
Because you're making some assumptions but in principle we could say that white box buzzing
0:01:59.520,0:02:05.160
would be able to prove the absence of error, in this sense be most effective.
0:02:06.420,0:02:07.740
And also it's quite efficient,
0:02:07.740,0:02:10.650
so if we - suppose we are -
0:02:10.650,0:02:12.840
we have these five - there are five paths in that program.
0:02:12.840,0:02:20.640
If we sample a path at random without replacement we would expect to take about three inputs to find
0:02:20.640,0:02:22.500
that crashing bug, right,
0:02:22.500,0:02:27.180
generated by enumerating all of these paths using this technique called white box buzzing.
0:02:28.980,0:02:33.240
Now on the other side of the spectrum there's something called black box fuzzing.
0:02:33.240,0:02:34.800
Black box fuzzing doesn't know anything about the
0:02:34.800,0:02:37.500
program - what it would do - it simply generates random
0:02:37.500,0:02:42.360
inputs for that program. So in - on most machines a
0:02:42.360,0:02:47.520
character can take one of 256 values, and suppose we just randomly generate
0:02:48.600,0:02:54.120
these values using sampling - uniform at random with replacement.
0:02:56.400,0:02:59.580
Of course it can never prove the absence of errors,
0:02:59.580,0:03:05.880
in fact there's this great quote by Dijkstra saying program testing can
0:03:05.880,0:03:08.700
be used to show the presence of bugs but never too show their absence.
0:03:09.300,0:03:14.340
It's almost more recently we've looked into this problem of giving
0:03:14.340,0:03:20.760
testing some kind of statistical guarantees so we can somehow compute what is called a
0:03:20.760,0:03:26.220
residual risk after a certain amount of testing but the effect remains,
0:03:26.220,0:03:31.740
black box fuzzing is not most effective - it cannot prove the absence of errors.
0:03:31.740,0:03:37.200
Okay so then if we look at this from a statistical perspective,
0:03:37.800,0:03:42.000
I said earlier that white box fuzzing would require about three inputs and
0:03:42.000,0:03:44.880
expectation to find that bug. Now black box fuzzing,
0:03:44.880,0:03:49.320
if we sample each input each value for each character uniformly at random,
0:03:49.320,0:03:52.140
we would expect to generate about four billion inputs before
0:03:52.140,0:03:54.360
discovering this bug. This sounds a lot.
0:03:55.200,0:03:58.680
So then white box buzzing should be better, right, it's the most effective,
0:03:58.680,0:04:02.340
it's also quite efficient. And here it's where it's going
0:04:02.340,0:04:05.280
wrong for the first time, at least not always.
0:04:06.240,0:04:13.920
Sometimes black box fuzzing wins. And this was discovered 30-40
0:04:13.920,0:04:18.840
years ago when people started doing experiments with these two techniques
0:04:20.940,0:04:25.500
with the hypothetically most effective technique and with just a simple random sampling.
0:04:25.500,0:04:34.020
Somehow they found that given some limited time, this very simple, very dumb, technique wins and
0:04:34.020,0:04:41.040
finds more bugs than the most effective technique. And so we looked at this very recently
0:04:43.140,0:04:48.180
from a statistical perspective and found that if a white box
0:04:48.180,0:04:52.980
fuzzer takes too long per input our black box fuzzer outperforms the white box fuzzer.
0:04:52.980,0:04:55.560
Right, if the generation for one input takes
0:04:55.560,0:04:59.760
too long then the black box fuzzer outperforms. Because the black box fuzzer generates
0:05:00.300,0:05:04.440
inputs very fast. So on my machine it takes about 6.3
0:05:04.440,0:05:08.820
seconds to generate 4 billion inputs and if I had 100 machines
0:05:09.540,0:05:15.840
it would take 63 milliseconds so we can easily scale bug finding
0:05:15.840,0:05:19.740
because because black box fuzzing is essentially embarrassingly parallel.
0:05:21.900,0:05:27.540
And so in this paper we actually give a probabilistic bound on this maximum time.
0:05:29.520,0:05:33.360
So then is black box fuzzing the best that we can do?
0:05:35.940,0:05:39.780
And the answer here is wrong, not - we can do even better than that.
0:05:40.440,0:05:44.340
So what I presented here was a generational black box fuzzer.
0:05:44.340,0:05:45.540
What it means is, I generate
0:05:46.980,0:05:51.540
values for each of these characters, but I could also - instead of trying
0:05:51.540,0:05:56.040
to generate inputs from scratch, which is the generational approach,
0:05:56.040,0:06:01.440
maybe we can reuse existing inputs, and this is called mutational black box testing.
0:06:02.160,0:06:09.480
So suppose we have a seed input which is "bad?" The bad input is "bad!"
0:06:10.260,0:06:14.400
So if we had a mutational fuzzer which simply selects the four - which
0:06:14.400,0:06:18.540
simply selects the character at random and then chooses a value for that at random
0:06:18.540,0:06:24.540
then with a probability of one over four times one over 256
0:06:24.540,0:06:31.740
an expectation of about 1000 inputs which is much less than four billion inputs.
0:06:32.340,0:06:38.340
But I cheated a little bit here. We chose a really good seed to start with,
0:06:38.340,0:06:42.720
so can we do even better? Can we somehow automatically
0:06:42.720,0:06:46.800
discover this seed input? And this is where the
0:06:46.800,0:06:52.380
third approach gray box fuzzing comes in. Can we make can we take the advantage of black box
0:06:52.380,0:06:55.740
fuzzing and the advantage of white box fuzzing, can we kind of combine it?
0:06:56.280,0:06:59.100
And gray box fuzzing - because we don't actually analyze the program,
0:06:59.100,0:07:03.300
but we take some feedback called coverage feedback from gray box buzzing
0:07:03.300,0:07:05.760
and we add generated inputs to the corpus which increase coverage.
0:07:06.780,0:07:12.240
So suppose we just start with a random input. The probability that we generate the first
0:07:12.240,0:07:16.080
coverage increasing input is one over one thousand,
0:07:16.080,0:07:20.880
which means an expectation we require about 1024 inputs to
0:07:20.880,0:07:26.640
generate the first coverage increasing input. We add it to the corpus then we select this
0:07:28.800,0:07:33.540
next seed uniformly at random and in this case - this seed - and
0:07:33.540,0:07:37.980
the probability that we chose this seed, this character and generate 'a' as the value
0:07:37.980,0:07:42.420
requires about 2000 inputs. So we go on like that,
0:07:42.420,0:07:48.300
and easily starting from a random input we can generate - we can
0:07:48.300,0:07:53.640
find a bug using only 10,000 inputs. And now my machine just takes 150 microseconds.
0:07:53.640,0:07:58.620
So we - so this is much faster than a symbolic execution tool which requires
0:07:58.620,0:08:03.420
all this machinery - constraint solving and encoding and so on for three inputs.
0:08:05.280,0:08:09.120
And in the work that we presented at CCS16 we also boosted gray
0:08:09.120,0:08:13.860
box fuzzing simply by choosing that seed which exercises the lowest probability domain,
0:08:13.860,0:08:18.060
and we go down further to four thousand inputs and to 55 microseconds.
0:08:19.560,0:08:24.960
Okay so the insight is, if you have a really efficient fuzzer
0:08:24.960,0:08:30.060
let's just throw more machines at the problem. You remember on my machine it takes 6.3 seconds,
0:08:30.060,0:08:34.620
on 100 machines it takes 63 milliseconds to find the same bug.
0:08:36.540,0:08:40.920
So then you might think, well, if I have - I can
0:08:40.920,0:08:44.340
take X times small machines means I can find X times more bugs, right?
0:08:45.120,0:08:50.520
And again we are wrong. So this is an empirical cost.
0:08:50.520,0:08:56.460
We looked at a lot of data where we see this is an exponentially increasing number of machines.
0:08:56.460,0:08:59.400
It's a log scale here. And this is the number of additional
0:08:59.400,0:09:04.620
vulnerabilities discovered within the same time. And we see that on a linear increase in the number
0:09:04.620,0:09:08.880
of new vulnerabilities discovered requires an exponential increase in the number of machines.
0:09:08.880,0:09:13.980
And this is the exponential cost. And a little bit of an explanation here,
0:09:13.980,0:09:17.580
this is just the intuition. If you think about just one bug
0:09:18.360,0:09:22.980
and we increase the number of machines exponentially for a long time we won't -
0:09:22.980,0:09:27.540
for a long time means the number of machines - we won't see that bug
0:09:27.540,0:09:32.280
and there's almost linear increase at a certain number of machines
0:09:32.280,0:09:37.080
and then we are again almost - we have discovered this bug and so
0:09:37.080,0:09:41.160
we kind of go on with a straight line. If we now - instead of having just one
0:09:41.160,0:09:45.360
vulnerability we have 10 vulnerabilities we kind of get these quickly lines and
0:09:45.360,0:09:48.660
they almost look like straight lines and we increase the number of - the
0:09:48.660,0:09:53.340
number of vulnerabilities, number of bugs or whatever you want to measure you find it -
0:09:53.340,0:09:58.920
you get more and more linear. And the reason for this is that we have these
0:09:58.920,0:10:04.260
kind of constant - we never see that either - we never see the one to be discovered or we
0:10:04.260,0:10:07.800
always see the one already discovered and in between we have this
0:10:07.800,0:10:13.020
kind of almost linear increases. This is how we kind of get this linear
0:10:13.020,0:10:19.080
increase for an exponential number of machines. Okay so intuitively each new vulnerability
0:10:19.080,0:10:22.200
requires some more resources than the previous vulnerability
0:10:24.180,0:10:28.980
so the constant rate of vulnerability discovery requires exponential amount of resources.
0:10:31.560,0:10:35.700
So what I showed you is white box fuzzing - we have a technique which
0:10:35.700,0:10:38.520
we really love to work with, which is really smart,
0:10:38.520,0:10:41.160
which analyzes the program, which in fact is the most effective,
0:10:41.160,0:10:45.540
but in practice it is easily outperformed by a very dumb
0:10:45.540,0:10:48.780
technique called black box fuzzing, by simply randomly generating inputs.
0:10:48.780,0:10:53.880
And in fact this is so efficient that we can easily scale that across a lot
0:10:53.880,0:10:58.620
of machines and find the same the same number of bugs X times faster.
0:11:00.060,0:11:03.420
I also talked about a machine that kind of leverages the efficiency
0:11:03.420,0:11:07.800
of black box fuzzing but it's more like - but it enumerates paths like whitebox fuzzing
0:11:07.800,0:11:12.780
but then I explained how even that gray box fuzzing technique includes black box fuzzing
0:11:12.780,0:11:17.160
and gray box fuzzing are somehow affected by an exponential cost.
0:11:18.000,0:11:23.820
Thank you so much. All right thank you very much Marcel.
0:11:25.200,0:11:28.620
It's great to see this kind of quantitative approach to this problem.
0:11:29.700,0:11:36.180
A question coming in from one of the viewers is, how much of the statistics do I need to understand
0:11:36.180,0:11:43.560
to be able to apply these sorts of techniques? Oh so there's - that's - to apply it has no
0:11:43.560,0:11:48.960
need to understand statistics. It's more like - to understand
0:11:48.960,0:11:55.860
the statistics is more a help for you that - For instance, the insight that you cannot just
0:11:55.860,0:12:00.480
throw more machines at the problem and then hopefully you find more bugs, doesn't work.
0:12:00.480,0:12:06.540
At some point you kind of run out of machines because the cost for
0:12:06.540,0:12:13.560
finding more vulnerabilities is exponential. That's interesting to understand why this is the
0:12:13.560,0:12:17.460
case and we use statistics and probability theory to explain why this is the case.
0:12:17.460,0:12:21.720
But when you apply fuzzing you don't need to know, you don't need to know understand the statistics.
0:12:21.720,0:12:29.580
Maybe one other thing that we explored is that - we are often interested in the
0:12:29.580,0:12:32.100
probability that we find about - if we have not found a bug.
0:12:32.100,0:12:37.500
Suppose you have run a campaign for 24 hours and you have not found any bug,
0:12:37.500,0:12:40.560
You still want to know what is the probability that we find a
0:12:40.560,0:12:45.900
bug with just one more input generated and this is where you can use statistics.