One of the most damaging IT incidents of 2024 went under the microscope – with a focus on key lessons learned – during the recent ISACA Conference Europe in Dublin, Ireland.
The 24 October conference panel, “Crowdstrike Outage Aftermath – The Path Forward,” moderated by past ISACA Board Chair Rob Clyde and including panelists Ameet Jugnauth, CGEIT, CRISC, Fabrizio Papi, CISA, CDPSE, and Onatkut Varis, CISA, CISM, CRISC, reflected on the July incident in which an errant software update from cybersecurity vendor CrowdStrike resulted in widespread outages to airlines, banks, and other industries, with estimated damages approaching US$10 billion.
“It shows you the seriousness of this kind of problem,” Clyde said. “It takes your customer down. It’s one of those mistakes that is very, very serious.”
Around 8.5 million devices running Microsoft Windows were reportedly impacted by the outage. Clyde said the incident might be a good jumping-off point for organizations to consider whether they are too reliant on single vendors.
“As cyber professionals, we’re trying to consolidate vendors,” Clyde said. “Maybe this effort to overly consolidate vendors is part of our problem. For instance, consider using a couple of endpoint security vendors rather than just one. That way should one vendor’s software cause crashes, the systems running the other vendor’s software will continue to run fine.”
Jugnauth said the incident underscored the level of due diligence that practitioners need to undertake when deciding what software to leverage and for what purposes.
“There’s all the things we know we know, and there’s all the things we know we don’t know,” Jugnauth said. “There’s a whole world of unknown unknowns. In some ways that’s what makes it exciting because we’re constantly trying to reduce the unknown unknowns, but it’s a big challenge.”
One of the focal points of the discussion was the prevalence of kernel mode access for software applications, a key factor in the CrowdStrike incident. Clyde said it is important to ask vendors what level their code runs at and how much of it runs at kernel mode.
“Generally, there are ways to only do a little bit of the code that is running at the lowest level, and the rest can be running at the application level, which is not going to affect systems quite as badly should it have a bug,” Clyde said.
In addition, Clyde suggested asking questions about the testing processes and tools that vendors use, such as Static Application Security Testing (SAST), Dynamic Application Security Testing (DAST) and Network Comparison Application Security Testing (NCAST).
Papi said that gathering deeper risk intelligence on vendors in advance of relying upon them is another important element of due diligence in avoiding major incidents, a sentiment echoed by Jugnauth, who emphasized the importance of asking the right questions when conducting third-party risk assessments.
“Be more specific, be more pointed, and it has to be specific to the risk appetite your firm has,” Jugnauth said.
As an example, the panel agreed that enterprises should ask if the vendor uses a phased or tiered roll-out system to ensure that not all customers receive the update at the same time, as occurred with the CrowdSrike outage. The update process should be quickly halted and then quickly rolled back if problems occur. In addition, an enterprise may want to implement processes to test updates with kernel mode software before deploying them in production.
In addition to examining the July outage, the panel discussed how organizations can become more resilient from an incident response perspective. Varis said that companies should do more work around validating recovery times in advance.
“You should identify your critical functions to be able to understand what should be recovered first and the efforts that you need to prioritize,” Varis said.
Clyde noted the immediate aftermath of large-scale incidents can be chaotic, opening the possibility to compounding problems with additional mistakes. He said malicious actors can be opportunistic in these situations.
“If you just Googled for the fix (after the CrowdStrike incident), you had a decent chance of actually downloading malware onto your system,” Clyde said.
In response to a question about what causes companies and vendors to sometimes not follow their processes, Papi said, “Competing business priorities, there is no time – it’s human nature. Sometimes it’s convenience. Sometimes it’s a perfect storm.”