0:00:05.280,0:00:09.280
Hi everyone good afternoon, evening, or 
morning depending on where you are right

0:00:09.280,0:00:14.640
now - super excited to be here - beyond excited 
to talk to you guys about some work that I have

0:00:14.640,0:00:20.160
done and I'm continuing to work on that I think 
will be of interest - to all parties here and

0:00:20.160,0:00:25.040
that is the work that I've been doing on causal 
testing - understanding defects' root causes.

0:00:26.800,0:00:31.200
So today specifically I'm going 
to talk to you all about, first,

0:00:31.200,0:00:36.880
how we - and by we I mean me - how I got here 
to talk to you about causal testing today.

0:00:37.520,0:00:40.880
I'm going to talk to you about causal 
testing, which at its foundation

0:00:40.880,0:00:45.360
is just a method for improving what you 
already do with what already exists.

0:00:45.920,0:00:50.080
I'm going to talk about other areas and ways 
that causal testing can be used in practice,

0:00:50.640,0:00:54.320
and I'm going to talk a little bit about 
whether it's actually found to be useful,

0:00:56.000,0:01:00.240
starting with how we got here, 
what's the back story, right, like

0:01:00.240,0:01:03.200
how how do we even get to talking 
about causal testing today.

0:01:03.920,0:01:12.000
Well it all started with a study that I 
collaborated on ten years ago this year or

0:01:12.000,0:01:17.280
next year - coming up on 10 years ago - which is 
absolutely outrageous to think about, but in my

0:01:17.280,0:01:23.520
PhD we - at the beginning of my PhD we were really 
interested in getting a kind of foundational

0:01:23.520,0:01:28.560
understanding of - in the space of all the 
tools that are available for developers, why

0:01:28.560,0:01:33.200
do they use the ones they do use, and why don't 
they use the ones that they don't use, right?

0:01:33.920,0:01:39.120
And so this is a really fun study to run 
- and from that we found a few things.

0:01:39.120,0:01:43.840
So we found that some of the major issues that 
developers have with the tools that are available

0:01:43.840,0:01:51.840
to them are around the tool output, so issues 
around digesting and understanding the output,

0:01:51.840,0:01:55.440
more specifically understanding 
the results that the tool provides

0:01:55.440,0:01:58.640
and answering questions like, why?
Why is this a problem?

0:01:58.640,0:02:00.560
Why should I care?
What do I do differently?

0:02:02.080,0:02:06.560
Tool design issues - which I think we can all 
agree - the list here probably goes on and on,

0:02:07.520,0:02:12.000
but different things cited under there, 
and then also workflow integration

0:02:12.000,0:02:15.920
tools that are - that seem awesome 
and maybe could be great, but require

0:02:15.920,0:02:18.640
some overhead to integrate 
into their current processes.

0:02:20.560,0:02:26.880
And so from the study I went on a mission to 
provide what would be considered useful, usable,

0:02:26.880,0:02:32.320
and most importantly validated as being such, and 
interventions for improving software practice.

0:02:34.000,0:02:41.520
So now fast forward some years to a post PhD, I 
ended up getting an opportunity to do a postdoc,

0:02:41.520,0:02:46.400
and in that postdoc I was given the opportunity 
to work in the testing space, which was actually

0:02:46.400,0:02:51.440
extremely exciting for me because in my PhD I 
spent a lot of time focused on static analysis,

0:02:52.160,0:02:55.280
and really only got to touch a little bit 
on the dynamic analysis side of things.

0:02:55.280,0:02:57.280
So I was really excited to have this opportunity.

0:02:58.320,0:03:04.640
And of course we already know that testing is 
a powerful and commonly used way of assessing

0:03:04.640,0:03:10.800
and validating and/or improving software quality, 
but some of the - a couple things that emerged,

0:03:10.800,0:03:15.920
or that I got a deeper understanding of as I did 
this work, or at least started doing this work,

0:03:16.640,0:03:22.160
was that there are a lot of testing 
techniques that are available for you.

0:03:22.160,0:03:25.520
Some have come from research, some 
are from practice, some are a nice

0:03:25.520,0:03:32.560
balance of both, but there are a lot out there.
And I also noticed that traditional testing alone

0:03:32.560,0:03:35.600
doesn't actually answer the 
question, why is this happening?

0:03:36.160,0:03:41.360
Right, it'll help us find a defect, it'll help 
us even locate it in our code to some extent,

0:03:42.080,0:03:45.840
but it doesn't always - it almost never answers 
the question, why did this behavior happen?

0:03:47.840,0:03:51.840
And so from doing some of this kind 
of background work and reading,

0:03:51.840,0:03:58.160
I came to a question of, can we take what 
developers are already doing and the work

0:03:58.160,0:04:03.920
that's already being done, to provide insights 
that existing tools don't currently provide,

0:04:03.920,0:04:07.680
specifically in this case, helping answer 
the why question - why is this happening?

0:04:09.600,0:04:14.320
And so to this question we provided a 
possible solution that we - I'll talk

0:04:14.320,0:04:19.120
more in depth about, in terms of why it is a 
solution, and that would be causal testing.

0:04:19.120,0:04:23.920
So this is where causal testing comes in, 
right, so causal testing is a method for

0:04:23.920,0:04:29.200
conducting automated causal experiments.
And this process starts with your existing

0:04:29.200,0:04:35.040
test cases and uses existing debugging techniques 
such as fuzzing and automated test generation,

0:04:35.680,0:04:39.760
with a goal of providing developers with 
minimally-different passing and failing

0:04:39.760,0:04:44.240
executions that help reason about why that 
failing behavior happened to begin with.

0:04:45.600,0:04:48.640
So how does causal testing do that?
How does that work?

0:04:48.640,0:04:53.600
Let's dig a little deeper and talk about 
the - the process of using causal testing.

0:04:53.600,0:04:59.280
Right, so say you have a test suite and 
maybe you have some continuous integration

0:04:59.280,0:05:03.120
set up or something like that where a 
test fails, and you get notified of it.

0:05:03.120,0:05:06.640
So say you have this notification 
or bug report that comes up:

0:05:06.640,0:05:10.320
directions from this location 
to that location are wrong,

0:05:10.320,0:05:14.720
right, and so let's say we got that bug 
report because this specific test failed.

0:05:15.520,0:05:20.960
So what causal testing does is, it takes this 
failing test and it takes the inputs from this

0:05:20.960,0:05:27.680
failing test and it attempts to perturb them in 
meaningful ways to produce additional valid tests

0:05:28.240,0:05:32.960
that we can execute and determine - and keep track 
of whether they are passing or they're failing.

0:05:34.240,0:05:40.080
Once we have a set of passing and failing tests, 
causal testing compares these tests to the

0:05:40.080,0:05:46.320
original using both the input to the test as well 
as the execution path that it takes in order to

0:05:46.320,0:05:51.920
present the developer with the most similar tests, 
assuming that that means that these are the most

0:05:51.920,0:05:57.920
relevant to that original failing execution.
And so in this example, given these similar

0:05:57.920,0:06:03.760
passing and failing tests, we pretty quickly 
are able to determine that our passing tests

0:06:03.760,0:06:07.360
are starting and ending in the same 
country, whereas our failing tests are

0:06:07.360,0:06:11.520
starting and ending in different countries.
And so now we have a better understanding

0:06:11.520,0:06:15.440
with minimal effort of why this test 
failed to now go and address it.

0:06:18.960,0:06:21.920
So you might be thinking - I'm 
hoping you're thinking - wow,

0:06:21.920,0:06:25.600
that's, like, so simple and so cool.
I know, it got me excited too.

0:06:26.240,0:06:30.240
And you might also be thinking, what else 
can we do with this: also what I'm thinking.

0:06:30.240,0:06:32.480
So let's talk about it.
What else can causal testing do?

0:06:32.480,0:06:34.800
Is it a one-trick pony or can 
it be applied other places?

0:06:35.440,0:06:40.320
Well a couple of directions that we're looking 
at are, first, causal fairness testing,

0:06:41.280,0:06:46.720
and so - and this work we actually have published 
as a demo and there is a prototype that has been

0:06:46.720,0:06:52.080
developed at the link provided here, but, 
so, causal fairness testing takes this causal

0:06:52.880,0:06:56.400
experimentation approach in 
the context of detecting bias.

0:06:57.600,0:07:02.080
So let's say, for example, we have some 
software and that software takes some inputs.

0:07:02.080,0:07:05.280
For simplicity's sake, yeah, let's 
say it's some loan software that takes

0:07:05.280,0:07:10.400
these four inputs to make a decision.
What causal fairness testing does is,

0:07:10.400,0:07:14.480
it automatically generates tests 
that look something like this.

0:07:14.480,0:07:18.720
We have some input based on our input 
space, it goes into the loan software,

0:07:18.720,0:07:24.160
and we observe, what is the outcome 
of that input based on that test.

0:07:24.160,0:07:27.920
Causal testing makes small 
singular changes to the input,

0:07:27.920,0:07:34.480
so for example changing green Brittany's race to 
orange Brittany, right, and conducts the same test

0:07:34.480,0:07:39.440
where we feed it into the software and observe 
the outcome - flip one additional attribute - one

0:07:39.440,0:07:44.320
singular attribute - observe the outcome.
And we do that over and over and over again within

0:07:44.320,0:07:50.480
some threshold to help answer questions, such as, 
how often is the outcome of my software different

0:07:50.480,0:07:55.360
just because of race, right.
So providing a method for - for kind

0:07:55.360,0:08:01.120
of - if you're worried about software that may 
have liability concerns or accountability concerns

0:08:01.120,0:08:06.320
around bias or fairness, providing a method for 
you to not have to create those tests on your

0:08:06.320,0:08:10.320
own - to be able to automatically generate tests, 
that can help speak to those types of concerns.

0:08:11.360,0:08:16.000
So that's one space where causal fairness 
testing could be useful, or causal testing.

0:08:16.000,0:08:18.240
Another space that we're 
looking at that I think is

0:08:18.880,0:08:25.520
really important to - to really dig into - is this 
idea of testing machine learning based software.

0:08:25.520,0:08:29.120
And so the work we're doing right 
now is looking at - so for example,

0:08:29.120,0:08:34.400
say you have some software and that software 
integrates some trained machine learning model

0:08:34.400,0:08:38.720
that aids in the decision making, right, and 
into that software, or some sets of inputs

0:08:38.720,0:08:42.800
let's say for this software we care about name, 
race, zip code, and the degree that they have,

0:08:43.360,0:08:49.360
right, and then presumably there is either some 
concrete set of outputs or classes of outputs.

0:08:49.360,0:08:51.200
Since we are using a machine learning model here,

0:08:52.640,0:08:57.840
we want to make sure our software is - is - is 
complying with - with respect to our expectations,

0:08:58.400,0:09:03.840
right, so what we're starting right now is, what 
does it look like to test this type of software,

0:09:03.840,0:09:08.960
particularly in the mode of testing that 
we typically use - that being assertions,

0:09:08.960,0:09:14.000
right, so can we write assertions that look 
something like this where we assert equal output

0:09:14.000,0:09:21.760
or outcomes for two sets of inputs, and then 
another example asserting true that for some input

0:09:21.760,0:09:27.840
we end up in a class or some specific output.
And if this is something that we can do,

0:09:27.840,0:09:32.640
then we can actually start to think about causal 
testing being beneficial in this context as well

0:09:32.640,0:09:36.960
for us to be able to - for example, 
see that if we change April to Adam,

0:09:36.960,0:09:41.600
right, our assertion doesn't break, 
versus here it is breaking, right.

0:09:41.600,0:09:45.360
If we keep doing that and we get enough 
tests then we can start to reason about

0:09:45.360,0:09:52.320
why something about this input space is causing 
unexpected behavior, all right, just another step

0:09:52.320,0:09:58.040
in the information chain that's required not only 
to understand behavior but to actually rectify it.

0:09:58.640,0:10:03.760
So two directions, super excited about 
we're working on in our lab right now,

0:10:03.760,0:10:08.880
but you might be of course wondering - what you 
should be wondering - is it actually useful?

0:10:08.880,0:10:13.200
Can I take this technique and do 
something meaningful with it in practice?

0:10:14.080,0:10:20.480
And we developed a proof of concept implementation 
to evaluate exactly this, where we found that in

0:10:20.480,0:10:26.480
terms of improving the ability to detect root 
cause - fixing these defects and being useful -

0:10:26.480,0:10:30.160
causal testing checks all the boxes, and 
more specifically in terms of being useful,

0:10:30.160,0:10:35.040
these similar passing tests point to the cause 
in terms of our - according to our participants.

0:10:36.480,0:10:40.240
And so in summary causal testing 
is a useful technique that provides

0:10:40.240,0:10:44.080
more insight into faulty executions 
with code that you've already written.

0:10:44.720,0:10:50.320
So don't hesitate look into the work, talk to me 
about it, let's figure out how causal testing can

0:10:50.320,0:10:59.200
become a part of your testing process.
Thank you so much for your time.