0:00:02.880,0:00:08.820 Hello my name is Gina Bai, I am a assistant professor of the practice at Vanderbilt and 0:00:08.820,0:00:15.360 I am excited to be here to talk about how novice testers perceive and perform testing. 0:00:16.440,0:00:21.120 We will be focusing on unit testing which is the most basic level of software testing. 0:00:21.660,0:00:29.700 So to begin with I would like you to make a guess - you can share that on Slack - so how much does 0:00:29.700,0:00:33.640 poor software quality cost the U.S in 2022? [pause] 0:00:45.180,0:00:47.280 Make a great - brave guess. 0:00:52.200,0:01:01.020 $1.4 billion? It's actually at least $2.41 trillion. 0:01:01.740,0:01:06.240 That's about 10% of the GDP if you asked of that year. 0:01:07.560,0:01:12.240 I know this is horrible, right, and I'm sure this will raise the awareness of 0:01:12.240,0:01:18.120 testing as a critical engineering activity, as well as the awareness that our developers 0:01:18.120,0:01:22.740 and of course our testers need to be able to perform testing well. 0:01:22.740,0:01:27.600 However, Titus Winters, who's a principal software engineer at Google, 0:01:28.380,0:01:33.240 pointed out that most of their new grad hires unfortunately have 0:01:33.240,0:01:37.620 very limited experience with testing. So testing skills and knowledge is one 0:01:37.620,0:01:44.220 of the gaps between CS education programs and industry expectations of graduating students. 0:01:44.820,0:01:50.760 It is an open question for all of us how to establish - how to enhance students, 0:01:50.760,0:01:54.060 the new hires, or let's say the novices' testing skills. 0:01:54.060,0:01:59.580 So to do so the first step is to learn how novices perceive and perform testing. 0:02:00.780,0:02:08.220 For most novices they see no difference between testing and debugging or think that the purpose of 0:02:08.220,0:02:15.360 testing is to show the correctness of the program. Because in most CS education programs - especially 0:02:15.360,0:02:18.720 the undergraduate programs - students are usually expected 0:02:18.720,0:02:24.600 to implement their programs given the description: make sure they compile, they run, they pass all 0:02:24.600,0:02:29.940 the test cases provided by the instructors. And in some courses students are also expected 0:02:29.940,0:02:36.240 to write their own tests - to test their own programs or programs implemented by their peers. 0:02:36.240,0:02:40.380 But in general if all test cases pass that usually means that's awesome the 0:02:40.380,0:02:43.380 program is ready for submission. If not it's time to debug. 0:02:43.380,0:02:49.380 So I would say it's not surprising at all to see that novices feel like - they have a blurred 0:02:49.380,0:02:55.260 conceptual line between testing and debugging. But the question is can we say a program 0:02:55.260,0:03:00.660 is 100% correct or is it perfect if they pass all the test cases? 0:03:00.660,0:03:06.240 Could it be the case that the testing is just not effective enough to 0:03:06.240,0:03:09.660 capture - to review the failure, or could it be the case that some 0:03:09.660,0:03:15.000 of the tests themselves are just not correctly designed and implemented? 0:03:16.200,0:03:22.440 So what we are trying to do is to train the novices - to help them build the tester's mindset, 0:03:22.440,0:03:26.880 and to be able to identify - to figure out the testing scenarios, 0:03:26.880,0:03:30.900 especially the corner cases that may break the program. 0:03:30.900,0:03:34.980 So that's the level two thinking. And of course we want to help the 0:03:34.980,0:03:40.200 novices to eventually get to level three - to realize that testing can only show the presence 0:03:40.200,0:03:45.240 of the failures but not their absence. We test the programs - we test the 0:03:46.080,0:03:51.780 software to reduce their risk of causing bad consequences or even catastrophes. 0:03:51.780,0:04:03.900 And let testers' mindset actually guide us to better design and develop the software. 0:04:03.900,0:04:09.960 So that's how novices perceive testing, and let's see how they practice unit testing. 0:04:09.960,0:04:14.700 We conducted several studies exploring their testing behaviors and performance 0:04:14.700,0:04:19.920 and here I'll present several representative questions from the novices. 0:04:21.060,0:04:25.800 So the first one: Amy had a typical question when the novices are going 0:04:25.800,0:04:31.500 up to level two from level one. She was not sure how to interpret 0:04:31.500,0:04:35.280 and handle a test that failed. Is it okay to have a failing test? 0:04:35.280,0:04:39.780 Does it mean a bug in the source code? Or is it a mistake in the test code? 0:04:39.780,0:04:46.740 We observe several cases in which the participants - the novices - even though, 0:04:47.580,0:04:51.780 even when they were informed that there were at least one bug in the code 0:04:51.780,0:04:55.680 they still trust the code over the program specifications 0:04:55.680,0:04:59.820 and they try to cover that source code regardless of the correctness. 0:05:01.080,0:05:07.500 And Bob needed guidance on when to stop testing. Is it determined by the number of test cases, 0:05:07.500,0:05:11.580 could it be - should it be the case that everything needs to be tested, 0:05:11.580,0:05:14.040 and how to make sure everything is tested? 0:05:14.040,0:05:17.520 Can we stop testing after, for example, finding one bug? 0:05:17.520,0:05:25.320 This can actually be like - this question can actually be partially solved by 0:05:25.320,0:05:30.360 consulting coverage tools like EclEmma which tells you - the user whether the 0:05:30.360,0:05:34.080 test cases are actually covering the source code and how much the code is exercise. 0:05:34.080,0:05:40.200 But we observed no adoption of tools like these in our studies. 0:05:40.200,0:05:46.560 This also suggests the lack of exposure to testing tools for novices. 0:05:47.340,0:05:53.160 At the same time we also observe some extreme cases in which the novices wrote 0:05:53.160,0:05:59.220 dozens of tests for just one single method and all of them were testing the happy path. 0:06:01.680,0:06:05.940 Charlie had difficulty in reusing the code examples from online resources 0:06:05.940,0:06:10.380 and Daniel cannot figure out how to actually implement a test to 0:06:10.380,0:06:14.160 indicate the existence of the bug. So in our study we also found several 0:06:14.160,0:06:20.940 novices where - who were able to identify to design the unhappy path test cases 0:06:20.940,0:06:26.460 but they ended up deleting all the test cases only because they cannot figure out 0:06:26.460,0:06:29.160 the correct syntax, For example to assert 0:06:29.160,0:06:35.520 an exception is thrown so they just gave up. But in general novices found it challenging 0:06:35.520,0:06:40.680 to determine what to test and how to test. They have no consensus on what makes a unit 0:06:40.680,0:06:46.800 good and hence novices find it challenging to determine when to stop testing 0:06:46.800,0:06:50.760 and they tend to only - to only test the happy path. 0:06:50.760,0:06:56.100 Additionally novices often create test cases that mismatch the program specifications 0:06:56.100,0:07:04.380 and they face implementation barriers. This could be from their - the lack of 0:07:04.380,0:07:10.200 hands-on testing practices or it could be some misunderstanding of the program descriptions. 0:07:10.740,0:07:15.660 But in response to those challenges and with the consideration of cognitive load, 0:07:16.320,0:07:19.920 well, for novices they have to 0:07:19.920,0:07:26.160 learn using tags, new concepts, new libraries, new tools to be able to practice testing 0:07:26.160,0:07:30.840 so we hope to keep the extra cognitive load as minimal as possible when we 0:07:30.840,0:07:34.920 introduce them to our support. So we introduce a lightweight 0:07:34.920,0:07:37.620 checklist intervention. Why checklists? 0:07:37.620,0:07:43.740 Since checklists were able to - they're so simple, right, and they're also super useful 0:07:43.740,0:07:49.860 in other software engineering research areas such as code review and software inspection 0:07:49.860,0:07:54.300 and a big feature of the checklist is that it is static 0:07:54.300,0:07:57.660 which reduces the learning curve for novice testers 0:07:57.660,0:08:02.220 and it is lightweight enough to be transferred across classrooms, 0:08:02.220,0:08:06.840 training programs, and languages, including natural languages and programming languages. 0:08:06.840,0:08:10.020 And if we take a closer look at the checklist items, 0:08:10.020,0:08:14.760 we can see that it is separated into two levels of abstractions, 0:08:15.600,0:08:19.500 one for test cases and one for test suites. 0:08:19.500,0:08:23.820 Each abstraction level also has two sets of checklist items, 0:08:23.820,0:08:28.260 the items that they should do, representing the important required 0:08:28.260,0:08:33.060 elements and items that they could do representing the best testing practices. 0:08:34.500,0:08:39.060 So it contains tutorial information, it briefly introduces the use 0:08:39.060,0:08:42.360 of test class components. The checklist also provides 0:08:42.360,0:08:48.000 the testing strategies - for example equivalence class partitioning and boundary value testing. 0:08:48.000,0:08:54.540 We didn't explicitly name those strategies in the checklist because they're novices, 0:08:54.540,0:08:57.540 so instead we reminded them to think about those cases. 0:08:58.440,0:09:03.780 What's more the checklist items are also designed to address the common mistakes and test smells 0:09:03.780,0:09:08.040 that are observed in prior studies as well as the in the classrooms 0:09:08.040,0:09:13.200 and you can see bad naming is one of them, just like what Christian said, the naming matters. 0:09:13.200,0:09:18.360 And the checklist is definitely not a like golden standard for practicing testing 0:09:18.360,0:09:24.840 but it helps the novices to write tests that are of good quality and reduce the number 0:09:24.840,0:09:33.000 of - on one day unneeded redundant tests. And we found out the checklist works well 0:09:33.000,0:09:42.420 and it is at least as effective as a coverage tool like EclEmma for writing quality tests for novices 0:09:42.420,0:09:46.020 which indicates that the tool support does not need to be 0:09:46.020,0:09:52.380 sophisticated to be mature, to be effective. And we also found that the novices who have 0:09:52.380,0:09:57.900 lower prior knowledge in unit testing should benefit more from the checklist. 0:09:59.820,0:10:04.800 To summarize, most novice testers see no difference between testing and debugging, 0:10:04.800,0:10:07.740 they cannot build them, and they believe the goal 0:10:07.740,0:10:12.000 of testing is to show the correctness. We discussed the challenges that they 0:10:12.000,0:10:16.500 encountered when practicing testing and we also showed that tool support 0:10:16.500,0:10:21.960 does not need to be sophisticated to begin with - a simple checklist will sometimes do the magic. 0:10:24.480,0:10:26.580 And I'll open to the questions. 0:10:27.480,0:10:36.420 Yeah fantastic, thank you so much Gina, again another great talk, we have time for questions 0:10:36.420,0:10:40.980 though if folks do have any please feel free to be putting those in the Slack. 0:10:40.980,0:10:46.560 I'll kind of kick things off. I really find this really interesting especially 0:10:46.560,0:10:50.520 since since I've been working at George Mason and I've been teaching a software testing class 0:10:50.520,0:10:55.320 and reflecting on my own CS education I realized that I never had that type of education 0:10:56.160,0:10:59.040 in my - you know, explicitly and in depth, right, 0:10:59.040,0:11:01.860 with what it means to write tests, to think about tests. 0:11:01.860,0:11:06.360 Do you think that part of the problem that led to this research is that we're not 0:11:06.360,0:11:10.500 focusing on that enough at a deeper level and do you think that interventions like 0:11:10.500,0:11:13.740 this could also support how we teach testing for example? 0:11:15.840,0:11:20.100 Well, because that's my work so I am probably biased, 0:11:20.100,0:11:25.080 but the reason why I started to work on testing education is that I personally 0:11:25.080,0:11:28.080 didn't receive any testing education, like, formal testing education, 0:11:28.080,0:11:33.360 when I was an undergrad I didn't see testing until in grad school doing my PhD work. 0:11:33.360,0:11:41.040 So I feel it is important for us to at least try something that's not that hard to adopt for both 0:11:41.040,0:11:47.280 students and professors to apply, to help them to teach testing, 0:11:47.280,0:11:54.120 help them to learn testing. And this is just again because not many professors 0:11:54.120,0:11:58.980 had a formal education or background in testing so they probably 0:12:00.000,0:12:07.140 won't touch that deep level of testing, so again most in most cases the practice is to 0:12:07.140,0:12:10.080 just have the students run it - run their program against their 0:12:10.080,0:12:15.660 test cases provided by the instructors and sometimes the test cases performed - provided 0:12:15.660,0:12:19.680 by the instructors are not even good enough. So absolutely, 0:12:21.780,0:12:25.620 I feel like the testing - the checklist may work as a good, 0:12:25.620,0:12:31.440 at least a minimal threshold of what your tests need to cover.