0:00:00.240,0:00:03.180 So good morning everyone, my name is Kai, 0:00:03.180,0:00:06.240 and I'm speaking to you from cold and rainy Maine this morning. 0:00:07.440,0:00:12.300 Like Ethel from our last section my focus is in computer science education as well, 0:00:12.300,0:00:17.520 so I'm presenting a recent project I did as I was finishing up my PhD last year. 0:00:18.840,0:00:24.360 So the motivation behind this project is that teaming is a core component in professional 0:00:24.360,0:00:26.640 software engineering. As pretty much all 0:00:26.640,0:00:30.780 software engineers work in teams, it is thus essential for undergraduate 0:00:30.780,0:00:36.060 computer science programs to teach students the skills of how to work effectively in a team 0:00:36.060,0:00:40.920 and how to to make valuable contributions despite the difficulty of splitting up a problem. 0:00:41.700,0:00:42.840 That being said, 0:00:42.840,0:00:47.940 there's some evidence that students may be inclined to free ride off of their peers 0:00:47.940,0:00:51.990 and then they receive a grade that does not correspond to their contributions - 0:00:51.990,0:00:54.840 they get credit for stuff that their - their teammates have done. 0:00:54.840,0:00:59.460 And I posit that much of the the difficulty with this is that, 0:00:59.460,0:01:02.760 or much of the reason for this, is that it's difficult to accurately 0:01:02.760,0:01:06.360 identify the contributions that students make to a team project, 0:01:06.360,0:01:10.680 particularly as the project gets bigger and as the team gets bigger. 0:01:10.680,0:01:15.120 And if teaching assistants struggle to identify what students have done, 0:01:15.120,0:01:19.440 they will struggle to give students helpful feedback on their contributions 0:01:19.440,0:01:23.580 and encourage students to make full contributions to the project. 0:01:24.420,0:01:27.360 And my - my question here, the central question, 0:01:27.360,0:01:34.620 I'll get down into the the details in a moment, is wondering if auto-generated summaries of what 0:01:34.620,0:01:39.240 the various members on a team have been up to are presented to teaching assistants, 0:01:39.240,0:01:45.300 can the TAs use this to get a better feel for the details of what each student has done 0:01:45.300,0:01:49.620 and then to give them feedback and grades according to this. 0:01:50.400,0:01:56.580 So the context I studied for this, was a sophomore level Java programming course 0:01:56.580,0:02:03.780 back at NC State University where I did my PhD. And this course features a lecture section that 0:02:03.780,0:02:08.400 has has four several-week projects associated with it 0:02:08.400,0:02:13.680 and then an associated lab section as well. And this is where students really learn the 0:02:13.680,0:02:17.820 collaborative work - the teamwork - as they work in teams of of three or 0:02:17.820,0:02:23.280 four students on these lab activities. And the way the course is set up is that 0:02:23.280,0:02:28.440 the lab grading is is mostly automated already through a whole bunch of scripts and use of 0:02:28.440,0:02:31.260 continuous integration platforms. Really, 0:02:31.260,0:02:36.240 the only thing that the the TAs grade manually is student code contributions 0:02:36.240,0:02:42.000 and then assessing whether the - the Java doc accurately describes the - the code in question. 0:02:43.140,0:02:47.160 So for the lab assignments students work in teams of three or four, 0:02:47.160,0:02:53.160 for three or four weeks at a stretch, at which point the teams are scrambled and 0:02:53.160,0:02:58.020 then this entire process repeats three times as the students complete a total 0:02:58.020,0:03:04.080 of 11 labs over the semester. So looking at what I studied in - in more detail, 0:03:04.080,0:03:06.480 my research questions were, first, 0:03:06.480,0:03:11.940 can automated contribution summaries help TAs grade assignments more quickly, 0:03:11.940,0:03:16.020 get through the process faster. And this I found basically, no. 0:03:17.160,0:03:18.360 Next, I considered 0:03:18.360,0:03:23.460 whether the - the contribution summaries can help TAs grade assignments more consistently, 0:03:23.460,0:03:28.680 more consistently identify what students have done than if they were unassisted 0:03:29.220,0:03:32.640 Third, I considered whether TAs prefer the grading 0:03:32.640,0:03:37.380 process when they have this to assist them, and finally whether it can help provide 0:03:37.380,0:03:41.460 students with better feedback, more actionable feedback that 0:03:41.460,0:03:45.660 they can use to figure out when they're sufficiently contributing to the team effort 0:03:46.320,0:03:50.400 and when they're not. So in order to answer these research questions, 0:03:51.420,0:03:58.500 I designed an algorithm that presents high level summaries of what each person on a team has done. 0:03:58.500,0:04:00.180 And the way it works is, 0:04:00.180,0:04:05.400 first it pulls metadata about a repository - things like commit hashes, timestamps, 0:04:05.400,0:04:11.220 authors, files changed, that sort of stuff - and tosses it into a database for later use. 0:04:12.420,0:04:13.200 Next, there 0:04:13.200,0:04:15.960 are two copies of the repository that are cloned. 0:04:15.960,0:04:21.720 And then for each of the - the changed files on each of the commits, 0:04:22.320,0:04:28.920 an abstract syntax tree is built representing the - the file before the commit and after the commit. 0:04:28.920,0:04:32.820 And I know that ASTs are sort of by definition abstract, 0:04:32.820,0:04:35.820 so an example to make it a little bit more concrete. 0:04:36.360,0:04:41.700 If we have a Java class that looks like this, it will boil down to the abstract syntax 0:04:41.700,0:04:46.440 tree we have over on the right side. And then if we make some changes to the class, 0:04:46.440,0:04:51.360 adding in a new field and a new method, we see some corresponding changes 0:04:51.360,0:04:55.980 to the abstract syntax tree. So the way that my algorithm works is, 0:04:55.980,0:04:59.100 it builds these ASTs from both revisions of the file 0:04:59.100,0:05:04.440 and then it will tree difference the ASTs to figure out what was added, changed, 0:05:04.440,0:05:10.440 or removed between the - the two versions. And then this is repeated for every file changed 0:05:10.440,0:05:16.020 on a commit and all commits within a time period, at which point the changes are binned or 0:05:16.020,0:05:20.220 summarized to get a high level view of what each person has been up to. 0:05:20.880,0:05:26.520 So to evaluate this algorithm and figure out if it helps TAs do a better job or not, 0:05:27.420,0:05:34.080 I recruited 13 former or current computer science TAs from my peers in the PhD program, 0:05:34.080,0:05:38.520 12 of whom had existing experience with grading team-based projects. 0:05:39.060,0:05:41.580 And then in my study I tasked them with, 0:05:41.580,0:05:43.260 first, grading some projects, 0:05:43.260,0:05:48.420 then evaluating and reflecting on some feedback from their peers in the study, 0:05:49.200,0:05:53.460 and then finally reflecting on the experience - on the study, on the contributions, 0:05:53.460,0:06:00.720 the contribution summaries they were provided. So in the first part of the study I tasked them 0:06:00.720,0:06:05.580 with grading a bunch of students assignments and I used a Google Sheets spreadsheet for 0:06:05.580,0:06:09.720 this just to to mimic the - the typical experience that they're familiar with. 0:06:09.720,0:06:15.900 And in the spreadsheet they were - they had rows corresponding to each of the repositories, 0:06:15.900,0:06:19.380 each of the projects they were to grade, and then columns with information 0:06:19.380,0:06:22.380 about what to do, so links to the automated 0:06:22.380,0:06:27.420 summaries for about half of the repositories, no automated summaries for the other half, 0:06:28.380,0:06:32.160 links to the the GitHub repositories so they could see all of the code, 0:06:32.160,0:06:36.480 the project history they were to look at, and then columns for them to fill in, 0:06:36.480,0:06:42.240 grades and feedback for the - the students whose projects they were grading. 0:06:43.440,0:06:47.160 So they filled this out for each of the three students on the team. 0:06:47.160,0:06:51.960 I've trimmed off students B and C so we only have A right here, 0:06:51.960,0:06:57.300 just to make it so we can actually read things. And then in part two of the study, 0:06:57.300,0:07:01.680 I asked TAs to reflect on some of the comments from their peers, 0:07:01.680,0:07:05.520 choose between pairs of comments that their peers had provided 0:07:06.240,0:07:09.660 on which one they felt was more actionable and they could do more with. 0:07:10.380,0:07:13.200 So to summarize what we learned at this point, 0:07:14.100,0:07:19.560 I found that the TAs grade projects much more consistently when they have the automated 0:07:19.560,0:07:24.060 contribution summaries to assist them compared to doing it entirely unassisted. 0:07:24.660,0:07:29.940 Yet at the same time the - the consistency, the inter-rater reliability - is pretty low, 0:07:29.940,0:07:33.600 so I use Krippendorff's alpha as a statistical test here, 0:07:33.600,0:07:41.100 and and Krippendorff argues for - for alpha values really of - of above 0.8 where possible, 0:07:41.100,0:07:44.100 and even with the contribution summaries to help them out, 0:07:44.100,0:07:49.140 TAs didn't quite hit this mark. So there's clearly future work still to 0:07:49.140,0:07:55.140 be done which I'll talk about momentarily. Next, in terms of whether the TAs would 0:07:55.140,0:07:58.140 actually choose to use this or not, because I'd come up with all the 0:07:58.140,0:08:02.040 shiniest tools in the world but if the TAs hate them it's not of much value, 0:08:03.120,0:08:07.920 I found that all of the participants preferred using the contribution summaries when grading 0:08:07.920,0:08:15.360 and 11 of the 13 strongly prefer them. And breaking them down on a - a feature 0:08:15.360,0:08:19.500 level I saw that the TAs found most of the features here to be pretty helpful, 0:08:20.400,0:08:22.560 both the simple information of just, 0:08:22.560,0:08:25.800 here's a list of commits of what each person has done, 0:08:25.800,0:08:30.720 and then the the more advanced stuff that came from my program analysis algorithm 0:08:30.720,0:08:36.300 and shows TAs where in the the project students have been involved. 0:08:37.140,0:08:40.920 And then finally in terms of results, getting back to the students, 0:08:42.660,0:08:47.040 I asked TAs to reflect on the - the feedback from their peers 0:08:48.240,0:08:52.140 and - and let me know which of the feedbacks - which - which feedback they thought 0:08:52.140,0:08:56.820 was more helpful, more actionable, and they considered that feedback from assignments 0:08:56.820,0:09:01.500 that had been graded with the automated contribution summaries was more actionable 0:09:02.040,0:09:05.460 than feedback that came from the manually graded assignments. 0:09:06.000,0:09:11.580 And additionally I found that the TAs give more partial credit as opposed to 0:09:11.580,0:09:14.520 full credit or no credit when they have the 0:09:14.520,0:09:18.360 contribution summaries available, which suggests that they're better able to - to 0:09:18.360,0:09:23.160 see nuance and identify partial contributions, as opposed to, 0:09:23.160,0:09:25.740 you've done basically nothing or you've done a bunch. 0:09:27.060,0:09:32.340 So that's sort of what I learned. As - as for, sort of, the implications of this, 0:09:33.480,0:09:40.200 despite a relatively small sample size - I only had just over a dozen participants in one two-hour 0:09:40.200,0:09:43.500 lab section where I ran the study - despite this, 0:09:43.500,0:09:47.640 the lab study still showed value to the contribution summary algorithm 0:09:47.640,0:09:53.160 in helping TAs identify what students have done and then give give them 0:09:53.160,0:09:59.040 grades and feedback accordingly. I'm doing a follow-on classroom study right now 0:09:59.040,0:10:02.340 where I'm trying to see, can this feedback actually help 0:10:02.340,0:10:07.080 students do better semester or - assignment on assignment over the course of a semester, 0:10:07.080,0:10:10.200 trying to see - do students find the feedback more actionable, 0:10:10.200,0:10:12.720 do they improve more over the course of the semester. 0:10:14.400,0:10:16.080 As for obvious future work, 0:10:16.080,0:10:20.340 there's obviously a lot more that will go into software engineering work than just Java code. 0:10:21.360,0:10:26.640 In terms of Python code, JavaScript code, but of course also non-code contributions, 0:10:26.640,0:10:29.640 the "everything else", the design, the project management, 0:10:29.640,0:10:34.320 the discussions around the water cooler that help the team work successfully. 0:10:34.320,0:10:38.280 So the open challenge remains, how do we account for everything else? 0:10:39.360,0:10:44.160 And I'm - I think a language agnostic AST analysis can get us part of the way there, 0:10:44.160,0:10:49.080 but there's still open questions on how do we account for the non-code contributions, 0:10:49.080,0:10:51.660 which is what I'm planning on pondering this 0:10:51.660,0:10:55.680 summer and figuring out what are the - the next steps we can take with this. 0:10:56.520,0:11:00.060 So to summarize what I did and what I learned from it, 0:11:00.960,0:11:05.700 I designed an algorithm that will present high-level summaries of what 0:11:05.700,0:11:08.520 students on teams or - or really anyone on a 0:11:08.520,0:11:12.600 team has contributed to their project and then I built it into a tool that works 0:11:12.600,0:11:17.700 on Java code tracked through GitHub and I did a quantitative lab 0:11:17.700,0:11:22.500 study where I demonstrated that the TAs who use this are able to 0:11:22.500,0:11:25.920 grade assignments more consistently, they prefer the grading process, 0:11:25.920,0:11:30.780 and I have tentative results suggesting the feedback to be more actionable, 0:11:30.780,0:11:34.980 more helpful to the students whose assignments are being graded. 0:11:35.820,0:11:40.800 So that's all I've got, but I would be delighted to take any questions at this point. 0:11:41.580,0:11:44.280 Great, thank you very much, Kai. Thank you. 0:11:44.280,0:11:47.820 First question coming in, so have you thought about 0:11:47.820,0:11:52.680 applying this to people doing code reviews in open source projects or in their jobs. 0:11:52.680,0:11:56.280 It seems like exactly the same summaries would be useful for somebody 0:11:56.280,0:12:00.660 who's about to dive into a large, you know, looking at a large pull request. 0:12:01.920,0:12:04.140 Amusingly my dad actually asked me the same question. 0:12:04.140,0:12:08.940 So he works at Salesforce and was was pondering whether this could be a useful thing 0:12:09.720,0:12:13.740 to see the changes made on a pull request and I've - I've considered it, 0:12:13.740,0:12:17.520 but not actually done any evaluation in that context. 0:12:17.520,0:12:20.040 Okay. The - the one the one 0:12:20.040,0:12:28.020 thing I think is is sort of dangerous here is, larger projects with more - more free-form 0:12:28.020,0:12:32.100 types of contributions that folks can make beyond just the Java code, 0:12:32.100,0:12:37.440 we want to make sure that people aren't having numbers being presented that look bad about them. 0:12:37.440,0:12:42.420 Maybe they're still making great contributions, they're doing all of the - the helping out the 0:12:42.420,0:12:45.420 newer members of the team, the big architecting stuff, 0:12:45.420,0:12:49.980 even if the the amount of code that they're doing is actually relatively small, 0:12:49.980,0:12:53.640 so I would want to make sure that, if I was going to use it in a different area, 0:12:53.640,0:12:56.220 that we're encouraging people to use it responsibly. 0:12:56.220,0:12:58.980 Okay, and another question that's come in, 0:12:58.980,0:13:03.060 do students use pair programming? Is there a way to take account of that? 0:13:03.060,0:13:07.680 In London Python dojos with similar team size you should try to have the least 0:13:07.680,0:13:12.480 experienced programmer at the keyboard. So the - the answer is, it depends. 0:13:12.480,0:13:16.680 Students are encouraged but not required to to do pair programming. 0:13:18.240,0:13:20.940 The way - the way that it works right now is that the - the 0:13:20.940,0:13:25.740 TAs will see a list of all of the commits and then in the commit message we hope, we - we 0:13:25.740,0:13:29.340 tell students to document in the commit message if they pair program. 0:13:29.340,0:13:32.400 My anecdotal evidence is that they do a lousy job of this, 0:13:33.360,0:13:37.080 so with - with limited documentation there's not a lot we can see, 0:13:37.080,0:13:43.140 but I have some planned work as well to maybe, give you - 0:13:43.140,0:13:45.480 I'm not - not figure out exactly what the numbers would look like, 0:13:45.480,0:13:47.700 how to weight things, but also to account for 0:13:47.700,0:13:51.120 these contributions at least where they can be discerned from commit messages. 0:13:51.120,0:13:56.520 So I have - I have some work in progress where I'm doing fuzzy matching on names and commit messages 0:13:57.600,0:13:58.980 and using that to figure out, 0:13:58.980,0:14:02.220 has someone else been involved in a pair programming effort, 0:14:03.000,0:14:05.700 and then we can account for that under their contributions too.