0:00:00.000,0:00:06.000 Next up we have Christian Newman for the last - for the hour here - who's going to talk about how 0:00:06.000,0:00:10.440 we can craft strong identifier names. Christian - educate us! 0:00:11.100,0:00:12.540 All right - can you all hear me? 0:00:14.280,0:00:15.540 Perfect. Great. 0:00:15.540,0:00:19.380 So my name is Christian I'm the head researcher at the SCANL laboratory, 0:00:19.380,0:00:23.460 so if you enjoy what we talk about today please feel free to visit our website which you can find 0:00:23.460,0:00:27.660 at the bottom of this slide scanl.org. So today I wanted to talk to 0:00:27.660,0:00:31.080 you about identifier names, and so on this slide you see several 0:00:31.080,0:00:33.540 different types of identifiers, some of them have one or 0:00:33.540,0:00:35.940 two different words in them, some of them have prepositions, 0:00:35.940,0:00:39.300 but a lot of them probably seem a little familiar to you 0:00:39.300,0:00:42.300 because you've probably seen identifiers that look like these 0:00:42.300,0:00:45.360 or perhaps even identifiers that look exactly like these 0:00:45.360,0:00:47.460 and I want you to keep that in the back of your mind as we're talking 0:00:47.460,0:00:52.260 because we're going to discuss why some of these look familiar to you as we go through. 0:00:53.040,0:00:56.880 So one question I wanted to ask about these is, well, are they good identifiers? 0:00:56.880,0:00:59.820 And if we kind of scan back here you'll probably look at this list, 0:00:59.820,0:01:01.800 you'll say, ah maybe, you know it depends, 0:01:01.800,0:01:02.880 like, you look at the code and 0:01:02.880,0:01:06.000 you might come to some conclusion, but the point here is that 0:01:06.000,0:01:09.000 we would have to talk about it. As developers we'd have to perform code reviews, 0:01:09.000,0:01:11.880 we'd have to look at the code, we'd have to come to some collective 0:01:11.880,0:01:15.120 conclusion about the quality of these. And so I think identifier naming is 0:01:15.120,0:01:17.460 at least partially subjective - it depends on the person that's 0:01:17.460,0:01:21.120 reading it and on the code context and what the person is trying to convey. 0:01:21.660,0:01:26.340 There's no objective quality metric for what makes a high-quality identifier. 0:01:26.340,0:01:30.660 But we've all seen it, right, so everybody's looked at some 0:01:30.660,0:01:33.480 code and they've been like, ah, this - I don't - I don't know what this is doing, 0:01:33.480,0:01:36.360 why did they do it that way, what does PCL mean, 0:01:36.360,0:01:39.240 why is this a single letter, I don't understand what this is 0:01:39.240,0:01:41.160 I hate Hungarian, why are they using Hungarian, 0:01:41.160,0:01:43.140 why is there an alphabet in front of my identifier. 0:01:43.140,0:01:48.360 So we know bad identifiers when we see it, even if we don't always know exactly why it's 0:01:48.360,0:01:53.040 bad or what makes it a bad identifier, we've all seen something that just made 0:01:53.040,0:01:57.600 us kind of, you know, cringe. So a metric or some way of being 0:01:57.600,0:02:00.540 able to understand the quality of these would be very useful. 0:02:00.540,0:02:04.200 It could tell us - instead of us having to talk about it we could look at the metric 0:02:04.200,0:02:07.020 and the metric says, hey, you know, this is good, and it 0:02:07.020,0:02:10.200 could give us some reasons for why it's good, and we might you know look at those and say, 0:02:10.200,0:02:13.320 okay, I understand why it's named that way now, it makes sense, okay, great. 0:02:14.220,0:02:17.280 Unfortunately, making a metric for this is very difficult, right. 0:02:17.280,0:02:21.120 There are different - and one of the big reasons for this is that there are different perspectives 0:02:21.120,0:02:25.560 that each identifier has to satisfy. You could be talking to a student or an 0:02:25.560,0:02:28.500 expert or a senior developer and each of these groups 0:02:28.500,0:02:34.560 needs a different kind of identifier. A senior understands the idioms and the - and the 0:02:36.000,0:02:39.240 terminology and the different grammatical structure of 0:02:39.240,0:02:42.540 identifiers that software engineers use, whereas a student has never seen it before. 0:02:42.540,0:02:47.400 And I want you to pay attention to that because students don't understand the 0:02:47.400,0:02:50.040 language of software development whereas a senior does, 0:02:50.040,0:02:55.200 and there's something important there, because what I'm saying here is that we've 0:02:55.200,0:02:58.500 created a language for ourselves internal to the software development community, 0:02:58.500,0:03:02.100 which is how we convey the meaning of software to one another. 0:03:02.100,0:03:06.840 And people outside of that community don't understand what we're talking about using this, 0:03:06.840,0:03:09.480 even though they might speak the same human languages we do so - 0:03:09.480,0:03:12.000 they might all speak English but they don't understand what we're talking about. 0:03:12.000,0:03:17.460 But program comprehension is critically important. It's the thing that we have to 0:03:17.460,0:03:19.980 do before you do anything else. If you're going to add a feature, 0:03:19.980,0:03:22.860 if you're going to fix a bug, you need to understand the software. 0:03:22.860,0:03:27.060 So we actually really should be paying attention to this - to this phenomenon. 0:03:27.060,0:03:32.220 Okay, so what makes this hard? Well over the course of of decades, 0:03:32.220,0:03:35.760 we as developers have created this sub-language and the sub-language, 0:03:35.760,0:03:39.300 which is based on a human language - so for a lot of us that would be English 0:03:39.300,0:03:44.880 but it can be other human languages as well - and this sub-language specializes English into 0:03:44.880,0:03:48.960 certain grammatical phrases and idioms that we use to describe software - 0:03:48.960,0:03:52.380 to describe the behavior of software. On top of that, 0:03:52.380,0:03:56.040 this is 70% of the code, so if you look at the characters in the code, 0:03:56.040,0:03:58.620 70% of all characters in the code are an identifier, 0:03:58.620,0:04:02.280 and so if you don't understand that sub-language that we're using then you 0:04:02.280,0:04:04.020 don't understand the code, you can't read it, 0:04:04.020,0:04:07.920 it's a foreign language too essentially. And we - and the other thing to realize 0:04:07.920,0:04:11.400 here is that we didn't design this language but we did create it, right, 0:04:11.400,0:04:17.400 so as developers we all evolved this language over time to deal with historical contexts. 0:04:17.400,0:04:19.320 So for example, old languages used 0:04:19.320,0:04:23.100 to limit the number of characters that you could use to - just to to create a variable. 0:04:23.100,0:04:28.860 This forced us to have to use much shorter speech to say the same amount of text that you would 0:04:28.860,0:04:31.740 say in a sentence, right. So we created this but 0:04:31.740,0:04:34.560 we didn't design it up front - we didn't come up with its rules, 0:04:34.560,0:04:39.060 we had to evolve it to to deal with our jobs and our - and our education. 0:04:40.140,0:04:44.700 So the thing that my lab is trying to do is study this sub-language, 0:04:44.700,0:04:48.840 to understand what are the rules that we have collectively curated over 0:04:48.840,0:04:52.560 time so that we can make the rules explicit, so that everybody understands what they are, 0:04:52.560,0:04:56.940 find ways to measure and describe the effect that these rules have on 0:04:56.940,0:05:01.980 comprehension for different types of readers, so for novices versus experts, for example, 0:05:01.980,0:05:08.340 and use this understanding to create better name creation, appraisal, and maintenance approaches. 0:05:08.340,0:05:12.000 So I wanted to talk to you today about one particular approach 0:05:12.000,0:05:14.700 and that's an approach that my lab has been studying a lot lately, 0:05:14.700,0:05:16.980 which is grammar patterns. The way that you get a grammar 0:05:16.980,0:05:20.850 pattern is you take an identifier - so you can see two on the screen here - 0:05:20.850,0:05:23.820 you apply a splitter - the splitter splits these 0:05:23.820,0:05:27.780 identifiers into their constituent words - and then you apply a part-of-speech tagger that 0:05:27.780,0:05:30.360 is specialized for the context of software, because remember, 0:05:30.360,0:05:33.360 because software follows a different grammatical structure, 0:05:33.360,0:05:35.400 you can't just take it off the shelf part-of-speech tagger 0:05:35.400,0:05:39.660 you you need to use specialized NLP techniques to deal with these things. 0:05:40.380,0:05:43.560 And what we did was, we did a study where we took a lot 0:05:43.560,0:05:48.960 of code and we looked through how this code - or we looked at the identifier name 0:05:48.960,0:05:50.940 structure for all this code, so we collected a bunch of 0:05:50.940,0:05:54.420 grammar patterns and we looked at how they were being used in the software, 0:05:54.420,0:05:58.380 and so I want to take a quick look at this. If you scan down in this little catalog, 0:05:58.380,0:06:02.100 what we did was we basically took the taxonomy and we 0:06:02.100,0:06:05.940 listed out every single pattern that we found. So these are grammatical structures. 0:06:06.540,0:06:09.840 So the first one here is a noun phrase. What a noun phrase is, is a sequence of 0:06:09.840,0:06:14.700 noun modifiers followed by a head noun. And so we can see examples of that down 0:06:14.700,0:06:19.560 here where we have the head noun "label". The head noun typically represents what 0:06:19.560,0:06:23.940 the - what the entity - what the variable actually is trying to convey in the code, 0:06:23.940,0:06:28.200 and then the words that come to the left, the noun modifiers, are descriptive, 0:06:28.200,0:06:32.280 so they function as adjectives. In fact noun modifiers are what 0:06:32.280,0:06:34.260 we call noun adjuncts, which are effectively 0:06:34.260,0:06:39.120 nouns that are posing as adjectives. And so in this case what we have is a label 0:06:39.120,0:06:44.040 that is specifically a selection width label, so this is a descriptive of the type 0:06:44.040,0:06:48.120 of label that we're talking about. If we go down we'll see prepositional phrases. 0:06:48.660,0:06:54.840 Prepositional phrases tend to be identifiers that deal with, for example, conversion, 0:06:54.840,0:06:58.440 so converting to a string or performing some kind of an event, 0:06:58.440,0:07:01.440 so, like, "on click", "on button press", that kind of stuff, 0:07:01.440,0:07:04.860 and so what I want you to kind of get, since we don't have enough time 0:07:04.860,0:07:07.920 to go through all of these, is that different patterns correlate 0:07:07.920,0:07:13.260 to different types of behavior in the software. And so really what we're looking at here is a 0:07:13.260,0:07:16.860 glimpse of this sub-language that I've been kind of talking about up to this point. 0:07:17.940,0:07:23.520 We've created these patterns of speech that allow us to convey behavior very quickly to one another 0:07:23.520,0:07:28.800 without having to be very explicit about what - what nuances there are under the 0:07:28.800,0:07:31.920 text - there's a lot of subtext here. And so these are some of the basic 0:07:31.920,0:07:35.400 natural language phrasal structures that we as developers have created over the 0:07:35.400,0:07:41.280 last 50-100 years or whatever. So what we did with these is, 0:07:41.280,0:07:45.600 we created a little bit of a tool - this is just kind of an initial way to 0:07:45.600,0:07:48.180 start trying to address the problem - and what this tool does is, 0:07:48.180,0:07:50.100 it looks at the grammar patterns in your code 0:07:50.760,0:07:52.860 and it looks at some of the some of the surrounding code 0:07:52.860,0:07:56.880 and it gives you a recommendation. So in this case we have the variable "characters". 0:07:57.960,0:08:01.680 Our tool sees that the type of this is singular so it's character, 0:08:02.400,0:08:06.840 and it's recommending that you not use a plural, but instead use a singular noun, 0:08:06.840,0:08:10.440 so basically make this "character" singular instead of "characters". 0:08:11.220,0:08:13.680 And so that's just a quick example of what this tool does. 0:08:13.680,0:08:18.600 And note here that it gives examples of what it's talking about when it gives these patterns, 0:08:18.600,0:08:22.080 and also an explanation, which are two things that we think are 0:08:22.080,0:08:30.060 very, very important for these types of tools. So our goal for this work in the future is to 0:08:30.060,0:08:33.060 fully explore the diversity of grammar patterns, so we - there are more of these, 0:08:33.060,0:08:37.320 we want to find out how many of these exist so that we can again make them explicit. 0:08:37.320,0:08:43.200 We want to create data-driven naming guidelines, so this is sets of measurements that can look at 0:08:43.200,0:08:46.500 different aspects of the identifier name - does it contain abbreviations, 0:08:46.500,0:08:52.020 does it contain domain terminology, etc. And we want to make these so that we can 0:08:52.020,0:08:56.880 understand how they affect the human reading it, because if - because again, 0:08:56.880,0:09:01.620 if you don't understand who is trying to read the code then you can't give a metric that tells 0:09:01.620,0:09:05.040 you how easy it is for them to read it. If I don't know who's looking at this 0:09:05.040,0:09:07.620 then I can't tell you if this is a good identifier for them or not. 0:09:08.220,0:09:10.980 And then we want to create a framework that optimizes, 0:09:10.980,0:09:16.380 or that helps us optimize names - that prioritizes the reader and explainability. 0:09:16.380,0:09:22.740 So another core tenant to this is that if you can't explain to people why this is 0:09:22.740,0:09:27.180 good or bad or what is good or bad about it then they can't make good decisions about 0:09:27.180,0:09:29.880 whether this is a recommendation that they should be taking. 0:09:29.880,0:09:34.680 So if we just give them a black box that tells them what the best thing is they won't be able 0:09:34.680,0:09:37.440 to tell me if this is good or bad because they don't know why the box 0:09:37.440,0:09:40.740 is recommending that - that particular pattern to them. 0:09:40.740,0:09:45.780 And so then our last goal is obviously educating developers at all levels about this language 0:09:45.780,0:09:49.170 so that they can correctly express themselves when they're - 0:09:49.170,0:09:52.920 so they can express themselves as optimally as possible to others when they're writing code. 0:09:52.920,0:09:56.760 And so that's the end of my presentation. If you're, again, if you're interested, 0:09:56.760,0:10:00.780 you can visit our website at scanl.org, the little QR code takes you to 0:10:00.780,0:10:06.120 our - the name structure catalog which also contains links to papers in our webpage 0:10:06.120,0:10:10.020 if you're - if you're looking to find those. And that's it - thank you.