The world recently woke to a massive global cyber outage paralyzing major enterprises and critical services. The culprit turned out not to be traceable to a sophisticated hacker, instead to a routine but insufficiently tested software update intended to protect Microsoft Windows users that had been pushed out by the cybersecurity firm CrowdStrike. The faulty update caused 8.5 million computer systems to crash in what has been dubbed the largest outage in history. This routine software patch replicated itself globally rapidly due to the growing dependence on cloud-based software. And while the problem was introduced automatically across the cloud, the remedy often required a manual, computer-by-computer fix.
The impact of the CrowdStrike outage was spectacularly immediate and is proving to be costly. The incident affected governments and businesses around the world, including airlines, banks, retail stores, hotels, and hospitals. Airline operations were affected for over a week, causing travelers to experience significant delays with some finding themselves stranded for several days. The financial impacts soared, with direct losses to Fortune 500 companies estimated at $5.4 billion, and total losses potentially much higher. A predictable legal dance has begun to determine liability of CrowdStrike and Microsoft, whose Windows users CrowdStrike was nominally supposed to protect from intruders.
The risk of incidents or accidents that broadly disrupt services, or compromise of information integrity and confidentiality is rising – and the CrowdStrike outage should be viewed not as a one-off mistake but as a harbinger of things to come.
The reason a routine software patch can cause such a problem is that we have our digital eggs in fewer, if larger, baskets now. This creates “concentration” risks. The concentration is intensified because a mere three companies—Microsoft, Amazon, and Google—control about two-thirds of the world market for cloud services. The market is estimated to approach $700 billion in 2024 and is expanding rapidly as these companies position themselves to control the AI market.
The innovative technology market has ushered in a stunningly rapid but unchecked transition from physical infrastructure to mostly virtual assets. This transition has generated numerous benefits but left critical services acutely vulnerable. Only a few years ago, cyber incidents tended to be localized to the servers maintained by companies, often at their site. An outage or hack was painful for that enterprise, but the impact—or “blast radius” as insurers and IT professionals call it—was limited. The CrowdStrike incident illustrates how a small technical problem can blow up, however, diminishing our collective confidence that we would be able to stay clear of a future virtual blast.
The CrowdStrike incident was, after all, both predictable and predicted. In a January 2024 report capping an 18-month study led by the Carnegie Endowment for International Peace, we warned that a cascading event could be triggered by technical and human errors or natural disasters, not just a malicious attack. We do not claim our warning to have been prescient; to the contrary, it was painfully obvious.
Going forward, it is crucial to acknowledge that the world is now facing a new and largely hidden systemic risk. Resilience—that is, the ability to prepare for, and recover from, the effects of shocks and stresses—should be a focus. A good place to start would be implementing the recommendations of Carnegie’s January report, where we offered a comprehensive Cloud Resilience Framework to assess and manage risk. As part of this framework, we recommended that technology providers increase transparency measures about contingency plans in place, which would help build trust. Technology providers should also work with customers to ensure that customers understand their resilience level because they rely on technology providers even as the systems and technologies become more complex and interconnected. As the CrowdStrike event illustrated, a small mistake can cascade and nearly topple the system.
Leading software providers should engage—ideally voluntarily—with external stakeholders (including customers, insurers, and governments) to conduct rigorous, scenario-based exercises. At this point, the system is so complex—with so many hidden dependencies that can cripple it—that stress-testing it is the only way to reassure us all. We encouraged technology providers to be proactive in a show of good faith to increasingly nervous policymakers; however, if the providers do not step up to increase transparency and provide reassurance, then governments should consider mandates to bring them to the table.
Post-mortem reviews of the CrowdStrike incident will identify a surfeit of lessons learned. Fingers will be pointed. Promises will be made. This step should not be skipped but it is a question of when, not if, the next cascading event will occur. Patching holes and whacking moles won’t be enough. This is urgent. AI and cloud services are already interwoven and difficult to disentangle. But one thing is clear: as society and commerce become increasingly reliant on an AI-enmeshed cloud, the resilience of that cloud will be crucial.