CrowdStrike Outage - The wake-up call we needed? Or just a warning of potentially worse to come?

Will we take the opportunity to ensure that there is no repeat?

Aug 12th 2024

Yet again, it was an incident which happened at one of the worst possible times. It was one of Britains busiest travel days of the year, with airports crammed full of people trying to set off on holiday following the breaking up of schools for the Summer. Around the world, a typical Friday was under way with businesses, hospitals, councils, governments, broadcasters and more, all attempting to operate as normal as the weekend approached.

It was around 8am UK Time when news outlets started to break the story, via push notifications and other means, that companies around the world were reporting an “IT outage” affecting their operations. Due to the Time Zone differences, the 1st reports of problems started in Australia but quickly spread as the morning progressed. Sky tv went off air completely, unable to broadcast. When they could broadcast again, presenters reported several systems, including teleprompters, were not available. Airports were experiencing issues with flight information systems. Airlines were reporting issues with their systems. Banks were starting to report issues, hospitals & GP’s were also having challenges……the reports kept coming.

Usually with outage incidents, the cause is identified and resolved relatively quickly, often within a few hours, and the reports start to decrease as the hours tick by - but not with this outage. They were increasing as we moved up to and beyond lunchtime in the UK.

At this point, very few knew what the problem was exactly, or what was causing it. “IT Outage” is a very vague description that is bandied around whenever there is disruption to systems of any kind. How many times have you heard someone say that there’s a “problem with the network” without knowing exactly what is causing the issue? Network support teams just love to hear that.

As the morning progressed however, things started to become slightly clearer and Blue Screen of Dead (BSOD) was the symptom that was being reported as the issue being experienced. Along with this, the word (or company) CrowdStrike started to appear more and more in the reporting.

The Emergency Line connects us to our emergency services — Flights from airports were disrupted on 19th July following the outage.

Who are CrowdStrike?

Not a familiar name for many, as opposed to McAfee or similar Microsoft products, but a huge player in the IT Security space nevertheless, CrowdStrike enjoy a near 24% share of the Endpoint Protection market. Co-founded in 2011 by George Kurtz (CEO), Dmitri Alperovitch (former CTO), and Gregg Marston (CFO), they are based in Austin, Texas and provide endpoint security, threat intelligence, and cyberattack response services. Their revenue is $3.06b and they employ nearly 8000 people. It was difficult not to feel sympathy for George Kurtz, their CEO, a former McAfee employee, and I think it’s fair to say he’s had better days than the 19th July, appearing throughout the day, carrying out an almost continuous loop of interviews on television and other media. Hopefully, he won’t have worst days than this one.

What happened and how did this issue have such a big impact on so many systems for so many companies?

CrowdStrike own, operate and sell security products that are designed to keep our systems safe. Their products are installed on millions of devices around the world - and not just Windows operating systems. However it was only the Windows version of their vulnerability scanner product, “Falcon Sensor”, that encountered the problem on this occasion. Like many providers, if you choose to, CrowdStrike offer a service to their customers that allows them to completely manage and deploy new updates of their security software to your system without you having to even think about it. If a new version of their product is available, they will push it to your system - no action required by you. The product scans the systems it is installed upon continually and if it identifies a threat, it can automatically work to respond and resolve the threat. This can work very well, is a very standard practice, and many companies sign up to this service.

However, in order for software to be able to manage and respond in this way, it requires full access to the kernel of the system. The kernel is a computer program at the core of a computer's operating system and generally has complete control over everything within the system. The kernel performs tasks, such as running processes and managing hardware devices, in this protected kernel space. With this level of control, the risk is that if something isn’t right with an update, for example if there is a flaw in the code of the update, it can seriously affect the Operating System and the device, crashing the system, resulting in a problem like the Blue Screen of Death (BSOD), the very issue which affected the millions of systems in question on this occasion.

CrowdStrike have released their PIR (Post Incident Review) report and also their RCA (Root Cause Analysis) report which explains what happened and it transpires that, as suspected and initially reported by Crowdstrike, an error in a content release was responsible for the problem.

Consider this

There are clear and obvious reasons why we could encounter problems when setting up our systems and applications to be managed centrally by a provider, allowing vendors and suppliers to fully administer them through a Cloud Management solution, meaning that they are not managed in-house or on-premises by the company themselves.

A fully managed solution, as well as the licences required to use the products, costs money, sometimes a lot of money. For example, if the cost of a product is £50 per licence per endpoint per year and you have 10,000 endpoints then this could cost you £0.5m per year, every year, for 1 security software product - in addition to the many other costs you will incur in order to run your business, using your hard fought-for and carefully curated annual budget. Due to this, companies sometimes only buy licences for the endpoints that they define as critical, and that they cannot be without. Sounds sensible, doesn’t it? It does, until something like the CrowdStrike issue happens and your critical, “cannot live without” devices are now the devices that you have to live without due to a flaw in the latest update which has been automatically pushed to you.

So what can we do, as individuals and businesses, to protect ourselves from a similar scenario like this occurring in the future and impacting us?

The 1st thing that you can do is carry out a review of the products you use which are managed and updated by vendors. Make sure you know and understand exactly:-

How the product is managed.
How many devices the software is installed on.
What control the vendor has over the product and your systems (for example, does the product have kernel level access?)
How often updates are pushed to you (do you know the release schedule for planned updates)?
If you are notified when un-scheduled releases are pushed to you (emergency updates to address critical vulnerabilities)?

The 2nd thing that you can do is review your Vendor Risk Management (VRM) Process, Make sure you know and understand exactly:-

The Product Release agreement that you have signed up to.
Whether you are part of (or can be part of a) an initial “ring 1” test group for early stage testing on a small number of non-critical machines.
The feedback mechanism for reporting the success or failure of initial releases.
The level of testing that the vendor completes before releasing their updates in to production (and pushing to your systems and applications).
Whether releases are staged (released in phases rather than a big bang approach to all).
What standards the vendor complies with, particularly in relation to the operating systems that their product runs on.

The 3rd thing you can do is review your Business Continuity / Disaster Recovery (BCDR) processes. Make sure you know and understand exactly:-

What could happen if an update goes wrong?
The devices that the software is installed on and if they are critical “cannot function without” devices.
What impact a failure of these devices would have operationally, financially & reputationally.
How you can continue to operate if this were to happen and do you have tried and trusted manual processes you can switch to or do you have a DR Environment you can move to?
How you can recover quickly and effectively after the event.
And, most importantly, is your leadership team aware that you operate this way (as they will need to assess and agree the risk of operating in this manner)?

Electric Vehicle Charging Station — Systems failed to boot.

Wrap Up

As more and more companies move away from the On-Premises management of systems and applications to central, Cloud Managed solutions for cost and overhead reductions, it is critical that we understand the pros and cons which come with this.

There are very few, trusted, large scale enterprise-level providers of services and due to this, many businesses use the same ones to manage their infrastructure meaning that the “all eggs in 1 basket” analogy for business and service providers around the world rings true. And if that basket develops a hole, a lot of eggs are going to be broken.

However, taking the right steps to understand how you are set up and what could go wrong can prepare you for this and give you a fighting chance to either avoid any impact or manage through any outages successfully to recover again quickly.

Tell me what you think

CrowdStrike Outage - The wake-up call we needed? Or just a warning of potentially worse to come?

Will we take the opportunity to ensure that there is no repeat?

Recent Posts

Comentários