0:00:00.600,0:00:03.900 Thank you so much for the intro and thank you for having me here. 0:00:04.920,0:00:11.100 So yeah folks, I'm super excited to be here, my name is Ariana and I just on Friday defended 0:00:11.100,0:00:17.160 my PhD at the University of California San Diego. And so this work was possible in large part 0:00:17.160,0:00:21.060 because of a collaboration with the IT team at UCSD, 0:00:21.060,0:00:24.180 who I've been working over the last year and a half as an embedded security 0:00:24.180,0:00:28.320 researcher within their operations team, so a big thank you to that entire team 0:00:28.320,0:00:34.140 and all the truly amazing work they do. And so generally my work has been broadly on 0:00:34.140,0:00:37.440 understanding and improving security processes, your large-scale measurement, 0:00:37.440,0:00:41.400 and today I'm going to talk about the theory and practice of vulnerability remediation 0:00:41.400,0:00:46.080 and a very specific type of developer from a security lens: system administrators. 0:00:47.220,0:00:52.740 And so many organizations, especially newer ones, have moved their organizational 0:00:52.740,0:00:57.180 infrastructure into the cloud. AWS is large, GCP is large, 0:00:57.180,0:01:00.120 and a lot of organizations have started to take advantage of that 0:01:00.120,0:01:04.500 and move their physical hardware components into the cloud so they no longer need to 0:01:04.500,0:01:08.760 maintain the physical piece of hardware but things are abstracted for them. 0:01:08.760,0:01:13.920 However, not all organizations have or can do this. 0:01:13.920,0:01:15.360 In fact, there are many 0:01:15.360,0:01:19.920 organizations that have legacy machines, or in other terms bare metal - the more 0:01:19.920,0:01:26.460 canonical term that are - that still exist and are critical pieces of infrastructure. 0:01:26.460,0:01:32.220 And UCSD is one of these such organizations where they definitely utilize cloud services 0:01:32.220,0:01:37.860 but there's just a ton of legacy systems that still exist physically on premise. 0:01:37.860,0:01:44.400 And so the theory, in an ideal world, is that every piece of infrastructure in an organization 0:01:44.400,0:01:49.500 that is on premise is up-to-date security-wise. So you have system administrators who are the ones 0:01:49.500,0:01:55.020 generally maintaining these pieces of bare metal who make sure that every piece of software and 0:01:55.020,0:01:57.480 hardware is up to date and there's no issues. 0:01:57.480,0:02:03.420 But the reality is that these disparate physical systems can affect the safety posture of an org 0:02:03.420,0:02:05.820 and they can have a large number of vulnerabilities 0:02:05.820,0:02:11.940 that are very difficult to triage and maintain and that an attacker can ultimately utilize to get 0:02:11.940,0:02:17.940 into the system and thus the organization itself. And so this process of getting rid of 0:02:17.940,0:02:20.580 vulnerabilities is called patching or vulnerability remediation 0:02:20.580,0:02:27.300 and I'll use those terms sort of interchangeably. And patching isn't a new problem I know there are 0:02:27.300,0:02:29.220 folks in this crowd who are probably nodding their head, 0:02:29.220,0:02:32.700 like, yeah, it is a pain, but it persists 0:02:32.700,0:02:36.240 and there are advances that have made patching an easier process 0:02:36.240,0:02:39.720 especially for organizations or parts of organizations that have been able to 0:02:39.720,0:02:43.560 transition to cloud services, like automation, abstraction, 0:02:43.560,0:02:49.500 and the thing about a lot of these advancements is that they optimize for the machine not the human. 0:02:49.500,0:02:54.900 And so when you're in an organization that still has legacy systems on premise 0:02:54.900,0:03:00.120 and still needs to maintain them the question that i went out - set 0:03:00.120,0:03:03.900 out to answer was, what if we tune the process for the human in the loop? 0:03:03.900,0:03:07.260 What if we took the process and the technologies that are being employed 0:03:07.260,0:03:12.420 and examined holistically how to make this process easier for the people doing the job? 0:03:12.420,0:03:17.460 In other terms, how can we make patching a more effective process? 0:03:17.460,0:03:21.120 And so we asked this question in our organization at UCSD, 0:03:21.120,0:03:23.880 because like I said I've been working as an embedded security researcher 0:03:23.880,0:03:25.440 and this was an issue that was 0:03:25.440,0:03:27.720 continually coming up - that, oh, we, you know, 0:03:27.720,0:03:35.820 are having difficulty getting people patch. And so in order to answer this question and 0:03:35.820,0:03:38.580 examine how we can optimize for the human in the loop 0:03:39.360,0:03:41.940 we first have to examine what was being done before. 0:03:41.940,0:03:47.460 And so I sat down with the team that was in charge of sending out these notifications 0:03:47.460,0:03:51.480 and this is an example of a notification that was sent out 0:03:51.480,0:03:58.740 to folks within the IT team at our organization. It was essentially a weekly report that was meant 0:03:58.740,0:04:01.260 to give these admins information, you know, 0:04:01.260,0:04:06.900 it's like - and just to read off bits of this - it says, "The systems below have active critical 0:04:06.900,0:04:11.820 or high severities, please patch within 24 hours," and then at the end of the email it listed 0:04:11.820,0:04:14.520 who's the technical contact, the host name, IP address, 0:04:15.660,0:04:20.100 and then also listed a link for how they could get more information from Qualis 0:04:20.100,0:04:24.120 which is the third party tool that our organization utilizes for vulnerability 0:04:24.120,0:04:30.420 scanning and information gathering. And looking at this there were 0:04:30.420,0:04:34.620 a couple things that stood out, especially having done a related work 0:04:34.620,0:04:37.380 search in the literature. First, 0:04:37.380,0:04:43.380 it required users to go and log in to Qualis, so not only required them to do this additional 0:04:43.380,0:04:46.440 step but it required them to have a login to Qualis. 0:04:46.440,0:04:48.600 And if any of you have worked in a large organization, 0:04:48.600,0:04:53.760 you know that it is not always the easiest to get logins into third-party tools. 0:04:54.300,0:05:00.240 The second thing that really stood out is that the email listed the raw number 0:05:00.240,0:05:02.280 of vulnerabilities, so in this instance there was 0:05:02.280,0:05:05.040 one severity five which is critical, and eight severity four, 0:05:05.040,0:05:08.520 but it didn't list the type - it didn't give any other information. 0:05:08.520,0:05:13.620 It really relied on the system administrator having access and having 0:05:13.620,0:05:19.320 time in that moment to go log into Qualis to look up one of the sev 4's,sev 5's. 0:05:19.320,0:05:25.260 And with any third party tool there are obviously issues - down times - so this didn't help. 0:05:25.860,0:05:29.520 And so what I'm trying to get at is that this old notification was not ideal. 0:05:29.520,0:05:32.940 It was a weekly notification which is great in theory, 0:05:32.940,0:05:36.900 but it did not list the vulnerabilities or additional details it required 0:05:36.900,0:05:39.810 these system administrators - who were already juggling many jobs - 0:05:39.810,0:05:42.420 to perform extra steps to get the necessary information, 0:05:42.420,0:05:47.940 and it adds this amount of friction that is required in order to execute. 0:05:47.940,0:05:50.640 And so again, working with the 0:05:50.640,0:05:54.900 security team and taking best practices from security literature and looking at what has 0:05:54.900,0:06:00.480 been done with vulnerability notification, I worked with the team to craft a new 0:06:00.480,0:06:04.260 notification in a new pipeline. And so this is the new 0:06:04.260,0:06:07.080 notification that gets sent out. And the things that I want to draw 0:06:07.080,0:06:09.000 your attention to is that, one, 0:06:09.000,0:06:14.160 each email focuses on a very specific type of vulnerability, 0:06:14.160,0:06:18.960 so instead of sending a laundry list of "here are the nine on your system" or whatever, 0:06:19.920,0:06:22.620 this focuses just on Microsoft Windows security updates. 0:06:23.460,0:06:29.340 There are instructions on how to patch the system just in case this was a new 0:06:29.340,0:06:30.900 vulnerability that they weren't aware of, 0:06:31.920,0:06:34.980 and at the end of this email, which is cut out in the screenshot, 0:06:34.980,0:06:38.100 there was a CSV that was pulled from the third party tool 0:06:38.100,0:06:42.240 that had a plethora of additional metadata, so it had the host name, the IP, 0:06:42.240,0:06:46.320 but I also had things like the full vulnerability name, 0:06:47.220,0:06:51.960 this - the CVE, other pieces of information that system administrators find really helpful. 0:06:51.960,0:06:57.060 And so for this first step, to try and address how do we make 0:06:57.060,0:06:59.880 patching a more efficient process, we examine the old notification, 0:06:59.880,0:07:03.360 proposed changes that reduce effort and time from the system administrators, 0:07:03.360,0:07:08.460 and crafted new notifications that have actionable items focused on one vulnerability 0:07:08.460,0:07:12.000 and listed all machines and vulnerability types in the attached CSV. 0:07:12.780,0:07:14.820 But like I mentioned at the beginning of this talk, 0:07:14.820,0:07:19.620 I do a lot of large-scale quantitative data analysis, 0:07:19.620,0:07:22.320 and so we don't actually know whether these 0:07:22.320,0:07:25.740 changes were effective until we went and analyzed the subsequent data. 0:07:25.740,0:07:29.340 And so I created an in-house pipeline that can be automatically run 0:07:29.340,0:07:34.320 that takes all the pieces of information from the system administrator side 0:07:34.320,0:07:39.120 and essentially produces a series of analyses that we can break down into different ways. 0:07:40.260,0:07:43.440 And in aggregate we saw that because of these changes, 0:07:43.440,0:07:49.500 the patching rate increased from 3% to 78% which is a huge difference. 0:07:49.500,0:07:53.940 This is already a success, but the natural next question was, 0:07:53.940,0:08:00.480 "Why was the patch rate only at 78%?" It seemed like we were doing everything right, 0:08:00.480,0:08:03.840 we have looked at the related work, we're doing best practices, 0:08:03.840,0:08:09.240 and it was still not at a hundred percent. And so the beauty of data is that there are 0:08:09.240,0:08:13.500 different ways to look and slice it. And so first, 0:08:13.500,0:08:18.060 I looked to see what different contacts - how they were patching their machines. 0:08:18.060,0:08:22.500 And we found that some contacts are just much better at patching. 0:08:23.160,0:08:27.180 When we then looked at the vulnerability families, we found that certain vulnerability families get 0:08:27.180,0:08:31.140 patched more things. Like Zoom, browsers, 0:08:31.140,0:08:36.000 standalone applications - were getting patched faster and at much higher rates 0:08:36.000,0:08:39.180 than things like operating system distros like Red Hat. 0:08:39.180,0:08:43.200 And the hypothesis there, which you know intuitively makes some sense, 0:08:43.200,0:08:47.820 is that standalone applications that have easier patching processes were easier to prioritize 0:08:47.820,0:08:50.820 because they don't require downtime for the system administrator. 0:08:50.820,0:08:52.680 Because again, system administrators 0:08:52.680,0:08:57.060 are juggling many jobs and many needs, including the needs of people who are 0:08:57.060,0:09:00.060 using those machines. And then finally we 0:09:00.060,0:09:02.520 also found that some vulnerability families just take more time to patch, 0:09:02.520,0:09:06.660 and so this is kind of following up from the the last analysis, 0:09:06.660,0:09:10.560 which is that there were some vulnerability families, 0:09:10.560,0:09:15.120 like operating system distros, and, like, Microsoft Windows updates, 0:09:15.120,0:09:19.440 that just took more time, and we - again, the hypothesis is that 0:09:19.440,0:09:24.600 there is some overhead that is required there that was slowing the process down. 0:09:25.380,0:09:28.080 But at this step, you know, we took a step back, okay, 0:09:28.080,0:09:32.160 the quantitative data is telling us a lot, but we also conducted semi-structured 0:09:32.160,0:09:36.420 interviews with the system administrators because we knew them, they knew us, 0:09:36.420,0:09:39.420 to add the qualitative view to the quantitative data. 0:09:39.420,0:09:44.220 And we learned a lot in these interviews. And some of the high-level takeaways was that, 0:09:44.220,0:09:47.100 first off, the monotonicity of the old 0:09:47.100,0:09:52.800 email notification made it really easy to ignore. And the reason that we were seeing a much higher 0:09:52.800,0:09:56.220 patch rate with this new notification was because it wasn't the same thing every week. 0:09:56.220,0:09:59.640 We also found that many teams have exception - exceptions, 0:09:59.640,0:10:03.900 and this was actually super interesting for us because it showed that 0:10:03.900,0:10:07.020 there was a discrepancy between the vulnerability remediation 0:10:07.020,0:10:10.020 notification pipeline and this exception pipeline. 0:10:10.020,0:10:14.880 There are some teams that have exceptions for various servers, various vulnerabilities, 0:10:14.880,0:10:18.540 and they thought that that was getting incorporated in the vulnerability pipeline. 0:10:18.540,0:10:22.200 And now that we know that there's a discrepancy, we are working on adding that in. 0:10:22.200,0:10:26.820 We also found that notifications fall outside of th assessment patch cycles, 0:10:26.820,0:10:29.580 you know, if we send an email on the second Tuesday, 0:10:29.580,0:10:34.650 they hadn't gotten to patching around - they hadn't gotten to patching the system yet - 0:10:34.650,0:10:37.080 because they were patching on the second week of that month. 0:10:37.080,0:10:42.180 And so this added a lot of additional insight into why the patch rate was only at 78%. 0:10:42.180,0:10:46.920 And overall we found that there was very positive sentiment towards a new notification, 0:10:46.920,0:10:50.040 but there was room for improvement and better integrations. 0:10:50.040,0:10:56.640 And so the - while the theory is that if you do everything right then folks will just follow, 0:10:56.640,0:11:01.920 the practice is that there are these very real blockers that you need to take into account, 0:11:01.920,0:11:04.860 especially blockers that are unique to your organization. 0:11:05.760,0:11:08.700 And so in summary I looked at how we could increase 0:11:08.700,0:11:10.920 the efficacy of patching within our organization. 0:11:10.920,0:11:15.000 We applied some very basic principles to reduce friction for system administrators 0:11:15.000,0:11:18.120 and in aggregate increase the patch rate from 3% to 78% 0:11:19.380,0:11:23.400 but additionally we found that by interviewing the system administrators, 0:11:23.400,0:11:26.520 many of them had a positive sentiment towards this notification 0:11:26.520,0:11:29.820 and that there were discrepancies in different systems that we can work on 0:11:29.820,0:11:33.660 to make it even more accurate and more productive moving forward. 0:11:33.660,0:11:38.340 And with that I'm happy to take questions and I'm also happy to take questions offline at these 0:11:38.340,0:11:41.460 various pieces of online communication. Thank you so much. 0:11:43.920,0:11:51.300 Fantastic thank you so much for a great and engaging presentation kicking off this last hour, 0:11:52.380,0:11:58.080 so again audience please make sure you're putting any questions that you have into the chat, 0:11:58.080,0:12:02.700 we have a few minutes so I am gonna kick off with a clarification question that 0:12:02.700,0:12:06.480 probably would have a pretty easy answer. So, like, the vulnerability - vulnerability 0:12:06.480,0:12:10.800 families that you mentioned, I think that's really interesting concept obviously, 0:12:10.800,0:12:14.760 helps us think about that space, is that a direct mapping to the 0:12:14.760,0:12:18.240 kind of technology that's being built or is that kind of, like, with security 0:12:18.240,0:12:21.540 vulnerabilities where there's like ways to think about the types of security 0:12:21.540,0:12:24.480 vulnerabilities that you have regardless of the platform 0:12:24.480,0:12:27.960 or the context or domain? Yeah, really good question, 0:12:27.960,0:12:32.100 so when I say vulnerability families, it's actually kind of a mix of both. 0:12:32.100,0:12:37.440 So it is very specific security vulnerabilities but for the 0:12:37.440,0:12:43.680 given applications that were on the servers. And so you know like, Zoom - Zoom for example has 0:12:44.880,0:12:49.200 various, like, RCE vulnerabilities but if a server that a system 0:12:49.200,0:12:52.380 administrator was managing didn't have Zoom we didn't notify them on that, 0:12:52.380,0:12:54.540 it was, we only notified them on the application 0:12:54.540,0:13:01.800 and then also the type of vulnerability, and so I guess to clarify a little bit further, 0:13:01.800,0:13:08.700 the emails focused on applications and then the CSV - the thing that was helpful for sys admins, 0:13:08.700,0:13:13.680 is that we then listed in the CSV the different types of security vulnerabilities 0:13:13.680,0:13:16.500 because different teams have different threat models, 0:13:16.500,0:13:17.520 you know, some teams are like, 0:13:17.520,0:13:22.260 "We're going to prioritize prioritize X over Y," and so it's useful that for them to know how many 0:13:22.260,0:13:24.600 of X versus y there were. Absolutely.