0:00:05.440,0:00:12.400 So today, I am Katie, and we're going to be  talking about my research which is in search, 0:00:12.400,0:00:18.320 but as we are developers we are focusing  this on not just any search but code search. 0:00:18.320,0:00:23.680 So code search is a term that I'm using to  describe the process of using search in an IDE, 0:00:23.680,0:00:28.240 in a browser, in another bespoke interface,  to locate some code related artifact. 0:00:28.960,0:00:35.440 And there are many ways - many  interfaces that exist for code search. 0:00:35.440,0:00:40.160 The most common by and large is  generic Google, and we've all done it. 0:00:40.160,0:00:44.960 We want some example code, we want to know how to  use an API, we want more information on something 0:00:44.960,0:00:49.680 code related like an error message, so we go  straight to Google, we issue a textual query, 0:00:49.680,0:00:55.040 and we hopefully find the result within  the first few results - first few queries. 0:00:55.760,0:00:59.360 There are other types of interfaces  that are increasing with sophistication. 0:00:59.360,0:01:06.640 One example is out of the GitHub ecosystem, we  have a textual query here with various ways - you 0:01:06.640,0:01:10.960 can filter on what you want, if you want a commit,  if you want an issue, if you want a discussion, 0:01:11.520,0:01:16.400 and then also ways of filtering based  on language APIs, things like that. 0:01:16.960,0:01:22.800 Also out of GitHub in the code search space moving  away from the textual query, is getting into 0:01:22.800,0:01:27.600 comment to code query, or  this is GitHub's Copilot, 0:01:27.600,0:01:33.680 or a little bit of code to code search here. And then there's also a lot of interfaces 0:01:33.680,0:01:37.280 that are within IDEs within specific  companies this is one that is similar to, 0:01:38.240,0:01:43.040 say, Google search - code search  tool that existed several years back. 0:01:43.040,0:01:48.240 So code search it turns out is happening all  the time, and we did a study of developers 0:01:48.240,0:01:53.200 at Google, and it turns out that it was  happening 12 times per developer per day. 0:01:53.200,0:01:57.520 And this may not sound like a lot, except when  I'm talking about 12 times a day, I mean search 0:01:57.520,0:02:02.160 sessions, and each - not individual query. So a search session involves 0:02:02.160,0:02:07.840 an average of three to four queries, a bunch  of clicks, some navigation, many minutes, 0:02:08.400,0:02:13.680 it's more than just a single search, and  so this means it's taking a lot of time. 0:02:14.320,0:02:18.480 And then when we looked at the  Google-specific tools, so, you know, 0:02:18.480,0:02:26.320 generic Google information search, and segmented  search logs based on non-code search and code 0:02:26.320,0:02:30.800 search, it turned out code search queries  took more time, more clicks, and required 0:02:30.800,0:02:36.000 more reformulations, which means more effort. And when we're looking at things that are 0:02:36.000,0:02:40.160 taking more effort, that means there's  an opportunity for improved support. 0:02:41.840,0:02:47.360 In our code search research we found that  there are four distinct needs that cropped 0:02:47.360,0:02:50.080 up as being the most common. So I want to talk about these. 0:02:50.960,0:02:56.000 One of them - the most common reason  developers were searching - were how questions. 0:02:56.000,0:03:00.800 How do I use this API, may I have some example  code that shows me how to do something. 0:03:00.800,0:03:04.720 This is about a third of the searches. And we got this information by surveying 0:03:04.720,0:03:07.600 developers while they were  using code search tools. 0:03:09.120,0:03:15.120 The second most common type of question was  a what question: what is this code doing? 0:03:15.120,0:03:18.240 Kind of a comprehension question. And this is about a quarter of the time. 0:03:18.960,0:03:21.440 Sometimes it can be answered  with comments, sometimes not. 0:03:22.960,0:03:26.880 Where in the code base something is  located, this is a localization question, 0:03:26.880,0:03:29.520 I got this error message I want  to know where it's coming from. 0:03:30.160,0:03:36.400 That's about 16 percent of the of the searches. And then why questions: why is something 0:03:36.400,0:03:42.880 happening was another about 16 percent. So I want to focus here in the beginning on 0:03:42.880,0:03:46.240 these how questions, because this  is the majority of questions. 0:03:48.240,0:03:50.480 So when you have a how  question, typically you have 0:03:50.480,0:03:56.080 some information and you want some information. With our current tools, Copilot notwithstanding, 0:03:56.080,0:04:01.040 we typically have a textual query or we  formulate our question as a textual query, 0:04:01.040,0:04:06.880 but what we really might want is example code. Another - and this is either - this is supported 0:04:06.880,0:04:08.640 well today. Another 0:04:10.080,0:04:19.200 method of looking for example code could  be to have a function as your query and 0:04:19.200,0:04:24.560 you want a function as your result. This would be a code to code search. 0:04:24.560,0:04:27.520 This can be useful in education  if you're looking for alternate 0:04:27.520,0:04:32.480 implementations of the same algorithm. It can also be useful for learning a 0:04:32.480,0:04:37.760 new programming language or doing translation. Here we have an example of translation: we have 0:04:37.760,0:04:45.120 a query in one language and a result in another. And so in order to facilitate code to code search, 0:04:45.120,0:04:48.480 it require - especially across  languages - it requires some language 0:04:48.480,0:04:53.840 agnostic - language independent analysis. So ideally there exists a mystery box - I 0:04:53.840,0:04:58.960 would love for this mystery box to exist  - where input comes in and it's a query, 0:04:59.600,0:05:05.200 and what comes out is code that behaves the same,  in the same language, maybe in another language, 0:05:05.200,0:05:09.840 and actually even more ideally it's not  just one result, but it's many results. 0:05:10.400,0:05:16.080 So this is what code to code search could look  like in an ideal environment, where we have 0:05:16.800,0:05:22.240 multi-language analysis and we are able to say,  this behavior is the same as this other behavior. 0:05:22.960,0:05:28.000 But to do this precisely we must have guarantees -  we must guarantee that for any two chunks of code 0:05:28.000,0:05:34.080 they behave the same and therefore they terminate. And so we run into the Halting Problem. 0:05:35.600,0:05:41.360 So in fact if we want code to code search to be  precise it'll never work in theory, but there is 0:05:41.360,0:05:46.320 some evidence that it can work in practice. So this is what we've been up to. 0:05:48.160,0:05:52.320 In code to code search, we have  lots of source code at our disposal. 0:05:52.960,0:05:56.640 We've had a lot of talks today on  mining GitHub, that's where we go to. 0:05:57.440,0:06:01.600 And let's say we have a bunch of snippets,  and these snippets all exist on GitHub, 0:06:01.600,0:06:07.520 and they have some - they have some similarities  - so there are - so there are three languages that 0:06:07.520,0:06:12.480 are represented here, some of these have similar  - have the same behavior, others have similar 0:06:12.480,0:06:17.040 behaviors, and under some inputs they behave the  same, under other inputs they behave differently, 0:06:18.000,0:06:23.680 and then we have other, like, depending on how  you chunk it, the behavior can also be the same. 0:06:24.880,0:06:30.400 And then we also have similarity across structure. So with all these levels of similarity, 0:06:31.600,0:06:37.120 we are able to draw parallels between different  areas of the source code and exploit these 0:06:37.120,0:06:43.280 similarities to create effective search. What we did was, we took source code and we 0:06:43.280,0:06:49.200 indexed it using three different dimensions to  create a multi-dimensional similarity analysis. 0:06:49.200,0:06:54.640 We use tokens or context because people can  write code with any variable name that they want, 0:06:54.640,0:06:58.320 but it turns out they tend to do it in a pretty  natural way, and we can use that information. 0:06:59.040,0:07:04.480 We took behavior using fuzzing, and we  took structure using a language agnostic 0:07:04.480,0:07:09.040 AST - abstract - abstract syntax tree. And we combined those all together 0:07:09.040,0:07:13.680 in a multi-dimensional analysis using  non-dominated sorting, so given a query 0:07:13.680,0:07:19.600 we got a bunch of similar code, and it was good. The results were very high precision, it worked 0:07:19.600,0:07:24.880 exactly as we wanted except in one dimension: it was dreadfully slow. 0:07:26.320,0:07:31.040 And - and so we needed to go back to the drawing  board, because it was so slow it wasn't practical. 0:07:31.760,0:07:35.280 Well, it turns out because people write code in  natural way we can take advantage of patterns. 0:07:36.000,0:07:39.120 So when we replace our slow  algorithm with something really 0:07:39.120,0:07:44.800 super fast for machine learning and embeddings, we have to remove the behavior to behavior piece, 0:07:44.800,0:07:48.640 but it turns out it doesn't matter,  because what we end up here with 0:07:49.280,0:07:56.000 is a fast approach to code to code search without  a loss of precision, and that's because of the 0:07:56.000,0:08:00.160 patterns that exist in source code - source code. So code code search is coming 0:08:01.360,0:08:06.800 and it helps us answer one of these needs  - it helps us answer questions about how. 0:08:07.840,0:08:11.840 But there are other questions that developers  are asking, and they are using search to 0:08:12.480,0:08:16.240 explain: what questions, where  questions, and why questions. 0:08:18.240,0:08:21.200 And code search isn't great  for all of these needs. 0:08:21.200,0:08:25.840 Code search is a hammer and not every  one of these questions is a nail. 0:08:26.720,0:08:30.960 So let's look at them one by one. So with example code, wanting to know 0:08:30.960,0:08:35.360 how questions, this can be done in practice with  search - you've done it yourself with GitHub, 0:08:35.360,0:08:38.320 with Google, I've shown you how  it can be done with code to code 0:08:38.320,0:08:43.280 search - a new kind of modality of searching. If you want to do what, if you want to ask answer 0:08:43.280,0:08:47.120 what questions, a quarter of the time these are  code comprehension questions, this isn't search. 0:08:47.760,0:08:52.240 And in fact if you are asking your  search engine what your code is doing, 0:08:52.880,0:08:56.800 you're asking it to tell you something about  the code - tell you a story about the code that 0:08:56.800,0:09:02.240 might not be true - and so you're assuming some  risk with that, and you are allowing yourself to 0:09:02.240,0:09:06.400 potentially misunderstand, so that's probably not  the best question to turn to a search engine for. 0:09:06.960,0:09:10.240 Where questions in the IDE,  talking about code navigation, 0:09:10.240,0:09:14.160 this is naturally supported, but why questions: 0:09:14.160,0:09:18.160 when we're talking about why is the code  behaving this way, we're talking about causality, 0:09:18.160,0:09:23.520 we're talking about impact analysis, and maybe the  on causal testing by Brittany Johnson might be a 0:09:23.520,0:09:30.800 better approach for those types of questions. So because these situations of code search, 0:09:30.800,0:09:35.200 where code search is not necessarily the right  tool for the question, occurs in 40 percent of 0:09:35.200,0:09:40.400 the questions based on the data we studied, it's  really important to know why you're searching. 0:09:41.680,0:09:44.960 Code search happens 12 times  a day, it takes a lot of time, 0:09:44.960,0:09:50.560 it takes a lot of queries, so if you're asking  how or where questions code search is your friend. 0:09:51.440,0:09:56.320 If you want to ask a what question - what is  the code doing - you're probably not - it's 0:09:56.320,0:09:58.960 probably not in your best interest  to start with the search engine. 0:09:58.960,0:10:02.320 You may be better off phoning a  friend, looking for pair programming. 0:10:03.760,0:10:08.160 Why questions are better rephrased starting  with the where, which is more naturally 0:10:08.160,0:10:11.840 answered with the search engine. So code search i not going away. 0:10:11.840,0:10:17.200 There are many - there may be many ways in the  future to answer these what and why questions with 0:10:17.200,0:10:27.040 more research, so stay tuned for that. Thank you for listening.