0:00:02.880,0:00:08.820
Hello my name is Gina Bai, I am a assistant professor of the practice at Vanderbilt and

0:00:08.820,0:00:15.360
I am excited to be here to talk about how novice testers perceive and perform testing.

0:00:16.440,0:00:21.120
We will be focusing on unit testing which is the most basic level of software testing.

0:00:21.660,0:00:29.700
So to begin with I would like you to make a guess - you can share that on Slack - so how much does

0:00:29.700,0:00:33.640
poor software quality cost the U.S in 2022? [pause]

0:00:45.180,0:00:47.280
Make a great - brave guess.

0:00:52.200,0:01:01.020
$1.4 billion? It's actually at least $2.41 trillion.

0:01:01.740,0:01:06.240
That's about 10% of the GDP if you asked of that year.

0:01:07.560,0:01:12.240
I know this is horrible, right, and I'm sure this will raise the awareness of

0:01:12.240,0:01:18.120
testing as a critical engineering activity, as well as the awareness that our developers

0:01:18.120,0:01:22.740
and of course our testers need to be able to perform testing well.

0:01:22.740,0:01:27.600
However, Titus Winters, who's a principal software engineer at Google,

0:01:28.380,0:01:33.240
pointed out that most of their new grad hires unfortunately have

0:01:33.240,0:01:37.620
very limited experience with testing. So testing skills and knowledge is one

0:01:37.620,0:01:44.220
of the gaps between CS education programs and industry expectations of graduating students.

0:01:44.820,0:01:50.760
It is an open question for all of us how to establish - how to enhance students,

0:01:50.760,0:01:54.060
the new hires, or let's say the novices' testing skills.

0:01:54.060,0:01:59.580
So to do so the first step is to learn how novices perceive and perform testing.

0:02:00.780,0:02:08.220
For most novices they see no difference between testing and debugging or think that the purpose of

0:02:08.220,0:02:15.360
testing is to show the correctness of the program. Because in most CS education programs - especially

0:02:15.360,0:02:18.720
the undergraduate programs - students are usually expected

0:02:18.720,0:02:24.600
to implement their programs given the description: make sure they compile, they run, they pass all

0:02:24.600,0:02:29.940
the test cases provided by the instructors. And in some courses students are also expected

0:02:29.940,0:02:36.240
to write their own tests - to test their own programs or programs implemented by their peers.

0:02:36.240,0:02:40.380
But in general if all test cases pass that usually means that's awesome the

0:02:40.380,0:02:43.380
program is ready for submission. If not it's time to debug.

0:02:43.380,0:02:49.380
So I would say it's not surprising at all to see that novices feel like - they have a blurred

0:02:49.380,0:02:55.260
conceptual line between testing and debugging. But the question is can we say a program

0:02:55.260,0:03:00.660
is 100% correct or is it perfect if they pass all the test cases?

0:03:00.660,0:03:06.240
Could it be the case that the testing is just not effective enough to

0:03:06.240,0:03:09.660
capture - to review the failure, or could it be the case that some

0:03:09.660,0:03:15.000
of the tests themselves are just not correctly designed and implemented?

0:03:16.200,0:03:22.440
So what we are trying to do is to train the novices - to help them build the tester's mindset,

0:03:22.440,0:03:26.880
and to be able to identify - to figure out the testing scenarios,

0:03:26.880,0:03:30.900
especially the corner cases that may break the program.

0:03:30.900,0:03:34.980
So that's the level two thinking. And of course we want to help the

0:03:34.980,0:03:40.200
novices to eventually get to level three - to realize that testing can only show the presence

0:03:40.200,0:03:45.240
of the failures but not their absence. We test the programs - we test the

0:03:46.080,0:03:51.780
software to reduce their risk of causing bad consequences or even catastrophes.

0:03:51.780,0:04:03.900
And let testers' mindset actually guide us to better design and develop the software.

0:04:03.900,0:04:09.960
So that's how novices perceive testing, and let's see how they practice unit testing.

0:04:09.960,0:04:14.700
We conducted several studies exploring their testing behaviors and performance

0:04:14.700,0:04:19.920
and here I'll present several representative questions from the novices.

0:04:21.060,0:04:25.800
So the first one: Amy had a typical question when the novices are going

0:04:25.800,0:04:31.500
up to level two from level one. She was not sure how to interpret

0:04:31.500,0:04:35.280
and handle a test that failed. Is it okay to have a failing test?

0:04:35.280,0:04:39.780
Does it mean a bug in the source code? Or is it a mistake in the test code?

0:04:39.780,0:04:46.740
We observe several cases in which the participants - the novices - even though,

0:04:47.580,0:04:51.780
even when they were informed that there were at least one bug in the code

0:04:51.780,0:04:55.680
they still trust the code over the program specifications

0:04:55.680,0:04:59.820
and they try to cover that source code regardless of the correctness.

0:05:01.080,0:05:07.500
And Bob needed guidance on when to stop testing. Is it determined by the number of test cases,

0:05:07.500,0:05:11.580
could it be - should it be the case that everything needs to be tested,

0:05:11.580,0:05:14.040
and how to make sure everything is tested?

0:05:14.040,0:05:17.520
Can we stop testing after, for example, finding one bug?

0:05:17.520,0:05:25.320
This can actually be like - this question can actually be partially solved by

0:05:25.320,0:05:30.360
consulting coverage tools like EclEmma which tells you - the user whether the

0:05:30.360,0:05:34.080
test cases are actually covering the source code and how much the code is exercise.

0:05:34.080,0:05:40.200
But we observed no adoption of tools like these in our studies.

0:05:40.200,0:05:46.560
This also suggests the lack of exposure to testing tools for novices.

0:05:47.340,0:05:53.160
At the same time we also observe some extreme cases in which the novices wrote

0:05:53.160,0:05:59.220
dozens of tests for just one single method and all of them were testing the happy path.

0:06:01.680,0:06:05.940
Charlie had difficulty in reusing the code examples from online resources

0:06:05.940,0:06:10.380
and Daniel cannot figure out how to actually implement a test to

0:06:10.380,0:06:14.160
indicate the existence of the bug. So in our study we also found several

0:06:14.160,0:06:20.940
novices where - who were able to identify to design the unhappy path test cases

0:06:20.940,0:06:26.460
but they ended up deleting all the test cases only because they cannot figure out

0:06:26.460,0:06:29.160
the correct syntax, For example to assert

0:06:29.160,0:06:35.520
an exception is thrown so they just gave up. But in general novices found it challenging

0:06:35.520,0:06:40.680
to determine what to test and how to test. They have no consensus on what makes a unit

0:06:40.680,0:06:46.800
good and hence novices find it challenging to determine when to stop testing

0:06:46.800,0:06:50.760
and they tend to only - to only test the happy path.

0:06:50.760,0:06:56.100
Additionally novices often create test cases that mismatch the program specifications

0:06:56.100,0:07:04.380
and they face implementation barriers. This could be from their - the lack of

0:07:04.380,0:07:10.200
hands-on testing practices or it could be some misunderstanding of the program descriptions.

0:07:10.740,0:07:15.660
But in response to those challenges and with the consideration of cognitive load,

0:07:16.320,0:07:19.920
well, for novices they have to

0:07:19.920,0:07:26.160
learn using tags, new concepts, new libraries, new tools to be able to practice testing

0:07:26.160,0:07:30.840
so we hope to keep the extra cognitive load as minimal as possible when we

0:07:30.840,0:07:34.920
introduce them to our support. So we introduce a lightweight

0:07:34.920,0:07:37.620
checklist intervention. Why checklists?

0:07:37.620,0:07:43.740
Since checklists were able to - they're so simple, right, and they're also super useful

0:07:43.740,0:07:49.860
in other software engineering research areas such as code review and software inspection

0:07:49.860,0:07:54.300
and a big feature of the checklist is that it is static

0:07:54.300,0:07:57.660
which reduces the learning curve for novice testers

0:07:57.660,0:08:02.220
and it is lightweight enough to be transferred across classrooms,

0:08:02.220,0:08:06.840
training programs, and languages, including natural languages and programming languages.

0:08:06.840,0:08:10.020
And if we take a closer look at the checklist items,

0:08:10.020,0:08:14.760
we can see that it is separated into two levels of abstractions,

0:08:15.600,0:08:19.500
one for test cases and one for test suites.

0:08:19.500,0:08:23.820
Each abstraction level also has two sets of checklist items,

0:08:23.820,0:08:28.260
the items that they should do, representing the important required

0:08:28.260,0:08:33.060
elements and items that they could do representing the best testing practices.

0:08:34.500,0:08:39.060
So it contains tutorial information, it briefly introduces the use

0:08:39.060,0:08:42.360
of test class components. The checklist also provides

0:08:42.360,0:08:48.000
the testing strategies - for example equivalence class partitioning and boundary value testing.

0:08:48.000,0:08:54.540
We didn't explicitly name those strategies in the checklist because they're novices,

0:08:54.540,0:08:57.540
so instead we reminded them to think about those cases.

0:08:58.440,0:09:03.780
What's more the checklist items are also designed to address the common mistakes and test smells

0:09:03.780,0:09:08.040
that are observed in prior studies as well as the in the classrooms

0:09:08.040,0:09:13.200
and you can see bad naming is one of them, just like what Christian said, the naming matters.

0:09:13.200,0:09:18.360
And the checklist is definitely not a like golden standard for practicing testing

0:09:18.360,0:09:24.840
but it helps the novices to write tests that are of good quality and reduce the number

0:09:24.840,0:09:33.000
of - on one day unneeded redundant tests. And we found out the checklist works well

0:09:33.000,0:09:42.420
and it is at least as effective as a coverage tool like EclEmma for writing quality tests for novices

0:09:42.420,0:09:46.020
which indicates that the tool support does not need to be

0:09:46.020,0:09:52.380
sophisticated to be mature, to be effective. And we also found that the novices who have

0:09:52.380,0:09:57.900
lower prior knowledge in unit testing should benefit more from the checklist.

0:09:59.820,0:10:04.800
To summarize, most novice testers see no difference between testing and debugging,

0:10:04.800,0:10:07.740
they cannot build them, and they believe the goal

0:10:07.740,0:10:12.000
of testing is to show the correctness. We discussed the challenges that they

0:10:12.000,0:10:16.500
encountered when practicing testing and we also showed that tool support

0:10:16.500,0:10:21.960
does not need to be sophisticated to begin with - a simple checklist will sometimes do the magic.

0:10:24.480,0:10:26.580
And I'll open to the questions.

0:10:27.480,0:10:36.420
Yeah fantastic, thank you so much Gina, again another great talk, we have time for questions

0:10:36.420,0:10:40.980
though if folks do have any please feel free to be putting those in the Slack.

0:10:40.980,0:10:46.560
I'll kind of kick things off. I really find this really interesting especially

0:10:46.560,0:10:50.520
since since I've been working at George Mason and I've been teaching a software testing class

0:10:50.520,0:10:55.320
and reflecting on my own CS education I realized that I never had that type of education

0:10:56.160,0:10:59.040
in my - you know, explicitly and in depth, right,

0:10:59.040,0:11:01.860
with what it means to write tests, to think about tests.

0:11:01.860,0:11:06.360
Do you think that part of the problem that led to this research is that we're not

0:11:06.360,0:11:10.500
focusing on that enough at a deeper level and do you think that interventions like

0:11:10.500,0:11:13.740
this could also support how we teach testing for example?

0:11:15.840,0:11:20.100
Well, because that's my work so I am probably biased,

0:11:20.100,0:11:25.080
but the reason why I started to work on testing education is that I personally

0:11:25.080,0:11:28.080
didn't receive any testing education, like, formal testing education,

0:11:28.080,0:11:33.360
when I was an undergrad I didn't see testing until in grad school doing my PhD work.

0:11:33.360,0:11:41.040
So I feel it is important for us to at least try something that's not that hard to adopt for both

0:11:41.040,0:11:47.280
students and professors to apply, to help them to teach testing,

0:11:47.280,0:11:54.120
help them to learn testing. And this is just again because not many professors

0:11:54.120,0:11:58.980
had a formal education or background in testing so they probably

0:12:00.000,0:12:07.140
won't touch that deep level of testing, so again most in most cases the practice is to

0:12:07.140,0:12:10.080
just have the students run it - run their program against their

0:12:10.080,0:12:15.660
test cases provided by the instructors and sometimes the test cases performed - provided

0:12:15.660,0:12:19.680
by the instructors are not even good enough. So absolutely,

0:12:21.780,0:12:25.620
I feel like the testing - the checklist may work as a good,

0:12:25.620,0:12:31.440
at least a minimal threshold of what your tests need to cover.