0:00:00.240,0:00:04.860
So yeah I've been - we've been working on naturalness and bimodality, thanks for the

0:00:04.860,0:00:10.080
intro and that, Greg, for many years now. These days every software engineer is

0:00:10.080,0:00:13.380
all excited about, and stoked about, large language models and their use in code.

0:00:14.760,0:00:20.460
Some people derisively call large language models "stochastic parrots", with some justification,

0:00:21.360,0:00:25.680
so what I'm going to talk about for the next nine minutes or so is what's going

0:00:25.680,0:00:30.240
on with these stochastic parrots which write code and programmers.

0:00:30.240,0:00:42.000
So I just want to acknowledge support from DARPA - IARPA actually - and National Science Foundation,

0:00:42.000,0:00:44.160
Sandia National Labs, and the Humboldt Foundation.

0:00:45.600,0:00:51.660
So reality check: the fact is that Codex GPT-x and so on are now widely used to generate code.

0:00:52.860,0:00:57.120
How much are people using this generated code and does it actually help?

0:00:57.120,0:00:59.820
And how good is this code? So these two questions are

0:00:59.820,0:01:02.760
what I'm going to be talking about for the rest of the - of this talk.

0:01:05.640,0:01:10.020
So I'm going to essentially present four papers. I encourage you to go look at them,

0:01:10.020,0:01:13.620
they're all very interesting, very good, using a wide range of methodologies,

0:01:13.620,0:01:20.160
and these are all human subject studies. The first one is from CMU - Carnegie-Mellon -

0:01:20.160,0:01:25.440
it's a survey of 410 developers, mostly GitHub Developers.

0:01:26.160,0:01:31.740
The second one is from Harvard, it's a control study of human subjects,

0:01:31.740,0:01:38.220
a small sample - the nature of control studies. The control here is the completion engine that's

0:01:38.220,0:01:42.420
built into Visual Studio called Intellicode. All subjects are university students,

0:01:43.080,0:01:47.460
that is what it is. And so okay, now the results.

0:01:48.540,0:01:52.920
In the case of the survey from CMU, 30% of the code was generated

0:01:52.920,0:01:56.340
according to GitHub developers, this was the number they put out.

0:01:57.780,0:02:02.460
They say it helps productivity - the GitHub developers - and they

0:02:02.460,0:02:07.800
said in - 74% of them said that they do a quick check of the code produced by Copilot

0:02:07.800,0:02:10.080
before they actually use it, so they get the completion,

0:02:11.040,0:02:13.200
they do a quick check, and then they use it.

0:02:13.200,0:02:16.440
They also complained that it wasn't very good at

0:02:16.440,0:02:19.500
dealing with non-functional requirements like security, performance, and so on.

0:02:20.760,0:02:24.480
And they also complained that it was hard to control the code that was being generated.

0:02:25.260,0:02:30.300
With the students and - they said Copilot didn't help them much

0:02:30.300,0:02:35.940
and they said they produced many defects and it was hard to understand the code they produced

0:02:35.940,0:02:39.540
but regardless of all that, the student subjects liked it anyway.

0:02:40.560,0:02:44.520
So those are the two papers from universities. Now a couple from companies.

0:02:44.520,0:02:50.580
The first one is from GitHub - this is not using Copilot, it's using their own completion engine.

0:02:51.780,0:02:55.680
The sample size is 10K and they did - they did the study using telemetry,

0:02:55.680,0:03:01.020
in other words they gathered data remotely to see how the code was being actually used.

0:03:01.620,0:03:06.120
The second study was a more regular study in some ways.

0:03:06.120,0:03:10.500
It used a triangulation - a combination of survey and telemetry,

0:03:10.500,0:03:13.680
and so they confirmed the results for one using the other.

0:03:14.880,0:03:22.680
So the results - so the Google study found that 3% of the code was generated - 3% of the

0:03:22.680,0:03:28.800
code that actually was entered was generated. About 6% of the cycle time was reduced - so

0:03:28.800,0:03:33.240
the cycle time is the time between those things the programmers do -

0:03:33.240,0:03:38.220
and about 30% of the suggestions that were made by the completion engine were accepted by the users.

0:03:40.200,0:03:46.680
So a very large sample study but, you know, as as is typical in these

0:03:46.680,0:03:51.360
telemetry studies you don't get much insight. The second study is - gives you some other insight

0:03:52.500,0:03:57.000
so 23% to 28% of the suggestions produced by Copilot were accepted by developers

0:03:58.320,0:04:00.840
and the acceptance rates correlate very well with

0:04:00.840,0:04:03.360
self-reported productivity according to the survey.

0:04:04.080,0:04:07.380
So this is, you know, this is all quite interesting,

0:04:08.280,0:04:11.400
so, you know, I encourage you to look at those papers,

0:04:12.000,0:04:13.740
they're all out there. The Google study was

0:04:13.740,0:04:17.160
not peer-reviewed the others are. The Google study was a blog post.

0:04:17.160,0:04:21.960
So my personal take on these code language - large code language models is,

0:04:21.960,0:04:25.380
developers like them and they use them according to these surveys.

0:04:26.640,0:04:31.500
It's not clear that they fully understand the code they're using and this is for me confirmed

0:04:31.500,0:04:35.700
both by the studies and from conversations with people, personal anecdotal conversations.

0:04:35.700,0:04:39.420
so we don't know what the personal software process is like when people use them.

0:04:39.420,0:04:43.020
I think that's - that's still an open question - I don't know what's going on.

0:04:44.700,0:04:48.840
So you know this is probably not going to surprise you but in a surprisingly short amount of time

0:04:48.840,0:04:52.620
every computer everywhere - laptops, mobile phones, toasters, microwaves,

0:04:52.620,0:04:55.200
air traffic control, nuclear power plants, cruise missiles, you name it -

0:04:55.200,0:04:57.420
they're all going to be using code generated by these language models.

0:04:58.080,0:05:02.460
So this is, you know, this is, kind of, at this point, kind of inevitable.

0:05:02.460,0:05:06.660
So all right this is the scariest slide in this talk.

0:05:06.660,0:05:12.600
AI-generated code will be running everywhere and at the very basic level,

0:05:12.600,0:05:16.260
you know, the question is, do these large language models generate code?

0:05:16.260,0:05:20.820
Because if they do this damn buggy code is going to be everywhere, right?

0:05:20.820,0:05:23.340
Okay, so this is the question we were interested in.

0:05:24.420,0:05:27.540
In a - we have a paper coming up at MSR - Mining Software

0:05:27.540,0:05:32.220
Repositories - which will be held in Melbourne. My student Kevin Jesse will be presenting it,

0:05:32.220,0:05:37.680
he's also graduating with a PhD in language models applied to code, so please hire him.

0:05:38.880,0:05:41.940
And you can have a look, scan the QR code for the paper.

0:05:43.020,0:05:46.680
Okay, so this is what we did. We took this data set, which

0:05:46.680,0:05:50.340
is pretty ubiquitous, it's one line bug fixes from a thousand projects,

0:05:50.340,0:05:54.540
about 17,000 samples after data cleaning. What we can do is,

0:05:54.540,0:05:58.620
we can go back in history and version control and find out when these bugs were injected by humans.

0:05:59.160,0:06:04.380
And since we know when they were introduced, we can try Copilot with the prefix of the

0:06:04.380,0:06:09.060
code at the time it was introduced and see whether Copilot produces

0:06:09.060,0:06:12.720
the buggy code or the fixed code. There's some problems with this

0:06:12.720,0:06:16.440
approach but I'll talk about them in a minute. But it does give you some kind of insights.

0:06:17.100,0:06:21.780
One thing I should say is, all the samples in the data set that we used were fixed by the time the

0:06:21.780,0:06:24.720
LLMs - by the time Copilot was trained, it was all fixed,

0:06:24.720,0:06:28.680
so Copilot is seeing the fixed code, and so we're sort of seeing what it does

0:06:28.680,0:06:32.220
when it's trained on the fixed code, because this data set was, you know,

0:06:32.220,0:06:37.560
now three years old, maybe four. Okay, so the result is as follows.

0:06:38.400,0:06:43.620
So in about 13% of the cases Copilot Codex reproduces the fixed code.

0:06:44.760,0:06:49.500
And about twice as often it regurgitates the buggy code.

0:06:50.100,0:06:53.880
So remember it was trained on the fixed code but it reproduces the buggy code, right,

0:06:53.880,0:06:59.160
so maybe the buggy code feels more natural to the model and so it, you know, it produces that.

0:07:00.540,0:07:05.340
So now there is a lot of, like, dark matter, which is not either the buggy code or the fixed code,

0:07:05.340,0:07:09.960
it doesn't match either one exactly. So what we did is, we took 400 samples

0:07:09.960,0:07:13.440
of this and manually examined it, and what we found is that in about

0:07:13.440,0:07:17.280
90% of the cases it's gibberish - it's neither the bug nor the fix,

0:07:17.280,0:07:21.600
it probably won't even compile, it just produces some random stuff, right.

0:07:21.600,0:07:25.500
But some of the cases, about, you know, 5% of the cases,

0:07:25.500,0:07:29.700
it's code that actually is either the bug or the fix where it is just written in some

0:07:29.700,0:07:33.000
other form that we can recognize. In some cases we just didn't know.

0:07:34.440,0:07:37.260
Okay, so I have about three minutes left,

0:07:37.260,0:07:39.900
so let me just tell you - mention a couple of things we also looked at.

0:07:41.940,0:07:47.700
When Copilot generates this simple, stupid buggy code we looked at whether they were stickier,

0:07:47.700,0:07:51.060
because we know how long they stayed in the - in the version control history, right,

0:07:51.060,0:07:55.200
So if it stays longer in the version control history you can think of it as being sticky.

0:07:55.200,0:07:57.900
In some sense people don't actually see it, you know,

0:07:57.900,0:08:01.920
so we looked at that as well. And the other thing we looked

0:08:01.920,0:08:04.740
at is the language models. These large language models

0:08:04.740,0:08:10.320
have enormous prior spaces - they're representing the priors, the prefix, and this enormous space,

0:08:10.320,0:08:15.600
and so they can be pushed around in this space. And one way to push them around is to,

0:08:15.600,0:08:21.240
metaphorically, you - to let them behave like a good programmer - is to

0:08:21.240,0:08:25.500
put comments in the - in the code, right. So we added comments and we see whether

0:08:25.500,0:08:28.980
Copilot repeats those errors when you put the comments in there.

0:08:29.940,0:08:35.160
All right, so, this is my last slide. So the takeaways:

0:08:35.160,0:08:43.020
programmers love these plugins for good or ill. Again, these large language models often

0:08:43.020,0:08:45.540
recapitulate human errors, and when they do,

0:08:45.540,0:08:48.660
it appears that these errors stick around for a longer period of time,

0:08:48.660,0:08:54.240
maybe because large language models - large language models produce code that looks natural.

0:08:54.240,0:08:57.180
So maybe it's harder for human eyes to see these errors

0:08:57.180,0:09:00.600
But the good news is, we can improve their

0:09:00.600,0:09:03.780
performance by adding comments in. More details are in the paper.

0:09:04.320,0:09:09.300
So my main take continues to be that developers will use these large language models

0:09:09.960,0:09:16.500
and it may be that these mistakes made by these models seem to somehow survive human review,

0:09:17.760,0:09:22.560
and, you know, maybe these errors are stickier. So with that I'll stop and take questions.

0:09:23.640,0:09:27.900
All right, thank you very much, Prem. Obviously a lot of interest in this

0:09:27.900,0:09:32.700
topic and that's an understatement. One of the comments that has come

0:09:32.700,0:09:39.660
in - do you believe that as we come to rely more and more on LLMs for code generation

0:09:39.660,0:09:43.560
that it's going to make it harder for new languages to gain a following,

0:09:43.560,0:09:48.780
because of course they won't have the corpus of code to train the LLMs,

0:09:48.780,0:09:53.580
so the tools that programmers will be used to using simply won't work as well,

0:09:53.580,0:09:58.620
so we'll be stuck with present-day languages. Is - how do you feel about that?

0:09:58.620,0:10:03.840
Yeah, interesting question. So my sense is that new languages

0:10:03.840,0:10:08.280
are - are a passion project, right,

0:10:08.280,0:10:11.100
so the people using them in the beginning and generating

0:10:11.100,0:10:16.080
the largest corpora of these new languages are not going to be affected by whether they

0:10:16.080,0:10:17.580
have Copilot or not, right,

0:10:17.580,0:10:21.840
so maybe what will happen is that once you have enough data,

0:10:21.840,0:10:25.080
then you can take one of these large language models and fine-tune them.

0:10:25.080,0:10:30.180
They are very quick at learning new things because they've been trained on so much of

0:10:30.180,0:10:34.320
what human forms of expression are in languages. If the new language is something completely

0:10:34.320,0:10:35.520
different, you know,

0:10:35.520,0:10:39.180
like APL or Haskell or something, maybe they have a bit of a tough time,

0:10:39.180,0:10:43.440
but I think, you know, for the - in most cases I think it'll be okay.

0:10:44.220,0:10:48.840
I might be wrong it's just a prediction. Okay, and we have a second question that's

0:10:48.840,0:10:51.780
come in, and this is one that I've been wondering as well.

0:10:51.780,0:11:03.300
So if an LLM generates buggy code and it goes into production who do we blame for faults?

0:11:03.300,0:11:08.940
Do we blame the programmer who shipped the code, do we bring blame whoever trained the model,

0:11:08.940,0:11:15.540
where does responsibility lie in this case? Great question - I would love for the Canadian

0:11:15.540,0:11:21.120
Parliament or the European Union to pass a law on this because it ain't going to happen in America.