what went wrong with the Crowdstrike outage, and how could it have been avoided?

Well, the first thing that Crowdstrike could have done is behave like professionals. We have known how to go from the creation of a file to having it installed on a customers machine without causing problems since the 1990s, and basically they did not do it. As further information continues to surface, things look progressively worse, both with how low the quality control levels were, and with them getting caught in a number of blatant lies.

So what happened?

According to Crowdstrike, someone picked up an untested template, used it to build a patch, ran a badly designed validator against it that passed it when it should not, and then shipped it directly to 8.5 million mostly mission critical machines without doing any additional testing on Friday, July 19, 2024 at 04:09 UTC.

Bad as it looks, the truth from the independent audit looks even worse. What actually happened is that a number of bad decisions at Crowdstrike resulted in untested code being shipped directly to customers with minimal tests run as a matter of policy and bad design.

How Crowdstrike works

How it works is that it ships a number of different components, including the core driver, the template file which collects multiple variables together from the core driver, and then uses them with the regular expressions included in the signature file by calling the regular expression function in the core driver.

What should happen is that with ever functional change, everything should run through all the tests every time, against every platform they deploy that change to, trying to prevent bad code from being shipped. Public statements prior to the incident by the company in general, and the CEO George Kurtz specifically, lead both shareholders and customers to believe that this is what they did. This was then given further credability due to his previous job as chief technology officer at McAfee during their major incident.

What actually happened was much worse. As deliberate policy according to Kurtz, when the template files or signature files are updated, they run against some validation tests, and then are shipped straight to all the customers at once as a live update. As they provide the option to run one or two versions behind the newest, every technical support department was shocked to learn after the outage that they only apply this setting to the core driver, so if anything is wrong in either of the other components, you are getting it live patched immediately, which you cannot disable.

As a result of this policy, the template and signature files never go through any integration testing with the core driver until they are deployed on the customers machines. This by itself makes the claim of good practice into a lie. Unfortunately it gets worse. The signature file gets tested against a mock of the template file, so in reality the template file does not get any testing at all. This also mostly invalidates the value of the tests run against the signature file. as they never get any integration tests to make sure the parts work together.

What shlould then happen is that the tested template and signature files get bundled together and shipped as the channel update patch, but at some point in the release, one of those files got filled with zeroes, which due to the lack of testing did not get found out.

This then interacted with some badly designed update software and a badly written driver to crash the kernel. On reboot the fact that Crowdstrike had set the boot start flag interacted with a known Microsoft boot loop bug to put these machines into a rapid reboot cycle, which was hard to recover from on locked down machines, which most of them were.

They started shipping a corrected patch 90 minutes later by reverting the patch, although the obvious question is why this near instant process took 90 minutes to discover the problem and implement.

Could this have been prevented?

Quite simply, do some testing!, seriously though, every change should go through continuous integration and a continuous delivery pipeline, preferably finishing with a canary release process.

Here is what should have happened

The new template is created in a separate directory from the tested ones, and does not get moved in with the rest until it passes the validator program. This prevents the creation and shipping of patches which won't pass. This new template should be in a text based format, so that it works with version control software. If a GUI based app is used to modify the file, it should edit the file, not generate it, as generators can radically change the text for minor changes in configuration, thus subverting version control. If a binary format is required for deployment, it should be generated from the text based source.

Once the person making the change is happy with the update, this change should go through some pre-commit tests, and once the tools are also happy with it, it should be committed to version control, as should every change which affects how the program is built and run. The source file should then be cryptographically signed, starting a validatable audit trail which ensures that what was committed and what was installed have something to do with each other.

At every stage after this, the automated tooling should check the key to make sure that the file has not been corrupted, then run its tests, and if it fails, stop the roll-out of the patch. You never ship buggy or untested code. A case can be made that you don't have time for testing, but if it is that urgent, you really don't have time to try and fix shipping buggy or broken code. If the process creates new files, these should also be signed.

Here is the most important point: Once the change has been committed to version control, everything should be automated, with the intent that no broken build should ever escape the company. At no point should it be possible to deploy without it going through the full testing pathway. This is where Crowdstrike made its biggest mistake. If it had gone through the same testing as a driver update, it could never have been shipped.

Continuous Integration

From here, your Continuous Integration process takes over, creating the deployment script, signing it, and deploying it to your test machines. This includes smoke tests, which tell you the machine actually started properly, and functional unit and integration regression tests against the public API of the system, which tell you that it still does what it did last time. By skipping this step, Crowdstrike missed another chance to prevent the release of the patch, as it would have crashed the test machines.

Every platform you deploy to needs at least one test machine, else you are just hoping it won't cause problems there. Failure to have a Debian Linux or redhat enterprise linux test machine was one of the reasons for the earlier Linux outage in April. While the bug was in the Linux kernel, they would still have spotted it before most of their customers.

Continuous Delivery

Once you have passed continuous integration testing, you know that the software still functions like it did before. At this point you move automatically to the Continuous Delivery pipeline. This is where you run every other type of test to try and spot things like performance degradations, and any other aspect of the software you care about. If it fails, again you stop the deployment. It would have been caught here as well.

Canary Releasing

Now it has passed continuous delivery, the system automatically uploads the patch to your delivery mechanism. Again it is worth pointing out that it should not be possible to get here without going through the full testing process. Allowing this to be done after minimal testing was another major failure at Crowdstrike.

Canary Releasing is the process of deploying to increasingly large groups only after the previous group has successfully deployed. If you have got to this point, you have already deployed to testing, so the next group should be the machines at your own company. if anything goes wrong, you have your own technical support people onsite to fix it, and again, if anything goes wrong you stop the deployment. This process of using your own product is called dogfooding.

When a customer installs the software, it is possible to ask if the machine can be easily fixed, or if it is hard to get to or otherwise locked down. This enables you to add another group for canary releasing, by selecting those where a fix is easy as another group to release to. If something goes wrong here, it is bad, because the problem should have been caught earlier, meaning you need to add more tests, but at least it did not fail on the flight information board 20 feet up in the air.

Finally, after all the other groups have passed, you can install on the awkward machines. By this time it should be safe, because you have already deployed to lots of other machines without issues. By not doing canary releasing, Crowdstrike made another huge mistake. If they had, most of the machines causing the ongoing bad press would never have reached this point, and it would not have made the news as more than a minor issue. Instead, it took down servers, billing systems, multiple hospital related systems, and most notably, Delta Airlines. It is a public relations disaster for Crowdstrike.

Mitigation Measures

Even when the patch started deploying there were things which could have helped. If the downloader checked file signatures, it could have detected the corruption Crowdstrike claimed happened after testing. If the driver checked its inputs made sense, it could have isolated the broken files. If the updater phones home after a successful reboot, the lack of responses could have stopped the deployment. Crowdstrike did none of this, which crashed the kernel, without alerting them. If your job is to actively monitor the kernel for future threats, not reporting a successful update is unforgivable.

They could also have checked after the reboot if it crashed, and downgraded the boot start flag they provided to windows. It is fine not wanting windows to boot without your driver if it is working, but when it fails, recovering the machine is usually much more important. At this point, the machine reboots without the driver, and the downloader can phone home and stop the deployment. It can also disable the last update, so it can be restarted in a working condition. And of course it can look out for a patch to fix the problem.

Microsoft don't look good here either, as this boot loop bug was triggered in 2016, 2018, 2022, 2023, and now in 2024. After any of those instances, they could have fixed it, but they did not. the fix is not even hard. They already provide the WHQL_Testing service, which your need to get your drivers signed. it could easily test if the driver can be disabled, and treat the boot start flag as a it currently does for working drivers.

If it can be disabled, it can treat is as a request for buggy drivers, which could be disabled on reboot like it does with other drivers. Again this would let the system reboot and recover.

As a result of neither company doing anything to prevent this, it looks like Delta Airlines is looking to sue for gross negligence to try and get back some of their losses, and I doubt they will be the only ones.

Ongoing Public Relations mistakes

Crowdstrike continue to make bad choices, making the bad press even worse. The CEO who was also at McAfee during a similar boot loop issue, failed to put any safeguards in place. Later he denied that the file full of zeros had anything to do with the crash, saying in the next sentence that deleting that same file fixes the problem. The first makes him look like a fool, the second like a liar. Then they thought a $10 uber eats gift voucher would be good enough compensation for the damages, but forgot to tell them it would be popular, making it get cancelled.

Microsoft similarly got it wrong, choosing to blast the EU for not letting them roll out patch-guard in Windows Vista in a way that extended their monopoly to cover antivirus software as well. The EU told them to go away and think again, and rather than making Windows Defender have to use the same APIs as everyone else, they chose to remove it. That is not the choice of the EU, but of Microsoft. They were even pushing back against regulators who wanted to make sure such companies were up to the job just 2 days before the incident.

People on other platforms, or writing code in other languages don't come out of it well either. No general purpose programming language is immune to bugs, and any kernel which allows third party kernel drivers can have a bug crash the kernel.

Damages and Lawsuits

As a result of the outage, the initial estimate is that it caused over 10 billion dollars worth of expenses, hitting pretty much every single fortune 500 company, almost all of which have large legal departments, and would like some of that money back.

One of the first lawsuits is a class action suit from the Crowdstrike shareholders claiming that the previously mentioned statements by the company caused the wrong impression of the technical competance of the company, while the failure to address the risks inherent in how they actually worked, as opposed to how they claimed they work, resulted in a massively overinflated share price, with a high likelyhood of a completely predictable problem leading to the colapse of the share price. This is a lawsuit which is likely to suceed.

Even worse, if it turns out any of the board disposed of a significant number of shares between the statements and the crash, they could be investigated for running a pump and dump scheme or insider trading, both of which carry serious penalties.

Delta Airlines has already hired a lawfirm to sue both Crowdstrike and Microsoft for their failures to stop this preventable outage, and try and get back the money they lost. The starting estimate is over 500 million, and it is expectd to rise as the true costs of the outage emerge. Due to the gross negligence involved with both companies, it is also highly likely to succeed.

I have no doubt that this is just the first lawsuits of many.