CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

tjukken · Jul 24, 2024

"full Root Cause Analysis". Well, one of the problems was that their software is running as root.

cyberfunk · Jul 24, 2024

How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.

Emon · Jul 24, 2024

tldr "we'll prevent this by using bog standard industry practices that are literally taught in schools."

Then they give out $10 Uber Eats gift cards as a "thank you" and then even those don't work because they themselves cancelled them after issuance. Can't even roll out a fucking gift card.

Just astonishing.

Edit: a little anecdote from me, I recently interviewed there for an infrastructure/platform engineering position. Development pipeline was literally something my team would own (for that specific area/team not whole company). I went through the full round and got no offer, but it was a huge red flag to me that the hiring manager, when asked why he liked working at CrowdStrike, wouldn't shut up about the "stock price." He seemed not very interested in much beyond the fact that it was "fast growing." I suspect many managers were there for the IPO and don't give a fuck as long as they get their payout. To be fair it was an overall good interview process with no BS and people I know do like working there. But still...

markgo · Jul 24, 2024

So they admit they deployed worldwide all at once. That has been against best practices for large scale deployments for more than two decades. Sue them into oblivion.

FinallyAnAccount · Jul 24, 2024

So are they saying that they relied on a content validator instead of pushing to an actual system? The fact that this happened to every Windows system it touched is damning.

cyberfunk · Jul 24, 2024

tjukken said:
"full Root Cause Analysis". Well, one of the problems was that their software is running as root.

As has been previously addressed, maybe they shouldn't be running certain parts in Ring 0, but AV software necessarily has to run with very high privileges to do the job. It's an unfortunately necessary evil.

What's indefensible is that their CI/CD mechanisms were so shoddy that they didn't catch this.

drlava · Jul 24, 2024

<quote>to allow its software to "gather telemetry on possible novel threat techniques."</quote>
deployed a broadening data collection update at midnight to all devices.. this deserves a deeper dive as well.

jhodge · Jul 24, 2024

Interesting that they detail testing some releases in a staging environment containing multiple types of sample systems:

"On March 05, 2024, a stress test of the IPC Template Type was executed in our staging environment, which consists of a variety of operating systems and workloads. The IPC Template Type passed the stress test and was validated for use."

...but not for the update that caused this problem:

"On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data."

Which at least strongly implies that content updates are not routinely tested in staging, but rather solely through the validator tool. I'm sure there are reasons, probably involving speed of deployment since they're previously mentioned deploying these content updates multiple times per day.

I hope that one of the lessons learned from this is that there is no substitute for testing in a pre-production environment that represents prod with some degree of accuracy.

szbalint · Jul 24, 2024

The company is specifically including "additional validation checks to the Content Validator"

They need to go beyond this and either switch the kernel modules to a memory safe language and implementation and/or switch to better standardized APIs so that no validation check issue can result in a crash loop like this ever again.

Edit: on more recent linux kernels Crowdstrike uses ebpf-sandboxed code, something similar would be needed on Windows

cyberfunk · Jul 24, 2024

jhodge said:
Interesting that they detail testing some releases in a staging environment containing multiple types of sample systems:

"On March 05, 2024, a stress test of the IPC Template Type was executed in our staging environment, which consists of a variety of operating systems and workloads. The IPC Template Type passed the stress test and was validated for use."

...but not for the update that caused this problem:

"On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data."

Which at least strongly implies that content updates are not routinely tested in staging, but rather solely through the validator tool. I'm sure there are reasons, probably involving speed of deployment since they're previously mentioned deploying these content updates multiple times per day.

I hope that one of the lessons learned from this is that there is no substitute for testing in a pre-production environment that represents prod with some degree of accuracy.

Yea, you can't just rely on unit tests and validators and sub-module checks, etc. They are good practice but are insufficient to ship. Seems like someone didn't learn that very important lesson in their coding bootcamp.

The embarrassing thing ? I'm a fucking product manager who doesn't write a line of code and I know this. It's so basic that the "dumb product guys" who don't understand all the details of engineering devops get it.

thrillgore · Jul 24, 2024

I don't think they tested at all.

cyberfunk · Jul 24, 2024

Emon said:
I went through the full round and got no offer, but it was a huge red flag to me that the hiring manager, when asked why he liked working at CrowdStrike, wouldn't shut up about the "stock price." He seemed not very interested in much beyond the fact that it was "fast growing." I suspect many managers were there for the IPO and don't give a fuck as long as they get their payout.

Send him an email, and ask him how he likes the stock price now :judge:

starglider · Jul 24, 2024

cyberfunk said:
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.

Typically in a case like this, the root cause analysis is kind of crazy to read. Obviously, the company is at fault, but the accident chain is always a pretty wild series of coincidences that always leaves me with some degree of feeling "there but for the grace of God go I." The most recent Microsoft hack was a result of this ultra-improbable cascade of a crash dump in the super-secure system ending up on a less-secure dev's machine for debugging, an obscure bug that didn't redact data in the crash dump, the dev's machine's being compromised, and the attackers just getting ridiculously lucky as they were combing through it, etc. Airline disasters have the same tone to them; it's always a chain of improbable stuff, even if there were serious mistakes along the way.

This one is just . . . they didn't stage the rollout and didn't, like, test it? WTF? I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.

OtherSystemGuy · Jul 24, 2024

Let me get this straight. Ignoring the boneheaded deployment issue, a company that sells cyber security software has a sloppy verification process? That seems to belie the very essence of software testing.

Why would I ever trust them?

zombi3g · Jul 24, 2024

Who do they blame for shipping the update to all their clients simultaneously? Staggered releases are a thing for a reason.

cyberfunk · Jul 24, 2024

starglider said:
Typically in a case like this, the root cause analysis is kind of crazy to read. Obviously, the company is at fault, but the accident chain is always a pretty wild series of coincidences that always leaves me with some degree of feeling "there but for the grace of God go I." The most recent Microsoft hack was a result of this ultra-improbable cascade of a crash dump in the super-secure system ending up on a less-secure dev's machine for debugging, an obscure bug that didn't redact data in the crash dump, the dev's machine's being compromised, and the attackers just getting ridiculously lucky as they were combing through it, etc. Airline disasters have the same tone to them; it's always a chain of improbable stuff, even if there were serious mistakes along the way.

This one is just . . . they didn't stage the rollout and didn't, like, test it? WTF? I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.

Yea, i've been part of things that have gone through the root cause analysis and come out.. "ok.. that's a one in a million chance of all those bad things lining up like this, and it shouldnt have happened but it's understandable how this didn't get caught" .

This was absolutely not that. It was clearly one of those very preventable things.

Internet_Explorer · Jul 24, 2024

The biggest change will probably be "a staggered deployment strategy for Rapid Response Content" going forward. In a staggered deployment system, updates are initially released to a small group of PCs, and then availability is slowly expanded once it becomes clear that the update isn't causing major problems.

How is it possible, in the year 2024, that this wasn't already in place? Pure incompetence.

l8gravely · Jul 24, 2024

Obviously they don't do fuzz-testing on their validator, feeding it all kinds of wonky input to see that it catches things. This is where you have a prize of a beer party once a month for the team that has the fewest bugs, the developers or the SQA people who break things. If SQA can break it, then they get the beer party!

rbirling · Jul 24, 2024

Ok so is the company liable for any of the downstream damages? (I feel like I know the answer to this.)

grimmm · Jul 24, 2024

jhodge said:
Interesting that they detail testing some releases in a staging environment containing multiple types of sample systems:

"On March 05, 2024, a stress test of the IPC Template Type was executed in our staging environment, which consists of a variety of operating systems and workloads. The IPC Template Type passed the stress test and was validated for use."

...but not for the update that caused this problem:

"On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data."

Swiss cheese theory of holes aligning, a bad template instance shouldn't be able to hose the core sensor, but a second bug might make that possible (especially if broken content is typically weeded out at the validation phase).

https://en.m.wikipedia.org/wiki/Swiss_cheese_model

l8gravely · Jul 24, 2024

cyberfunk said:
Yea, i've been part of things that have gone through the root cause analysis and come out.. "ok.. that's a one in a million chance of all those bad things lining up like this, and it shouldnt have happened but it's understandable how this didn't get caught" .

This was absolutely not that. It was clearly one of those very preventable things.

It's the swiss cheese model... all the holes lined up and the problem got through. This is big in aviation where one little mistake if it lines up with other little problems can litterally kill you.

grimmm · Jul 24, 2024

starglider said:
Typically in a case like this, the root cause analysis is kind of crazy to read. Obviously, the company is at fault, but the accident chain is always a pretty wild series of coincidences that always leaves me with some degree of feeling "there but for the grace of God go I." The most recent Microsoft hack was a result of this ultra-improbable cascade of a crash dump in the super-secure system ending up on a less-secure dev's machine for debugging, an obscure bug that didn't redact data in the crash dump, the dev's machine's being compromised, and the attackers just getting ridiculously lucky as they were combing through it, etc. Airline disasters have the same tone to them; it's always a chain of improbable stuff, even if there were serious mistakes along the way.

I believe no hard evidence for the crash dump theory was ever provided and as such it was later recanted.

msawzall · Jul 24, 2024

Internet_Explorer said:
How is it possible, in the year 2024, that this wasn't already in place? Pure incompetence.

Money.

Happy Medium · Jul 24, 2024

cyberfunk said:
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.

Because the above would cost money and require some degree of forethought and caution, and that's a crime for a CEO! Don't you know that a CEO's highest duty is to make as much money as possible? This duty explains and totally excuses literally putting lives at risk, abusing their employees, breaking laws, having no morals, and assisting others with crimes. Don't you know that a CEO doesn't want to torture small puppies and take food from the mouths of infants, their fiduciary duty requires them to do so. That's why they pay themselves so much, and make everyone but themselves pay the price for their failures, because of how much being a leader is an imposition on them! /s

Robscura · Jul 24, 2024

Who Validates the Validator?

Rolling out software updates is hard, especially across machines that are all configured differently. While we should applaud their use of a 'Validation' step, whatever validity that step possesses must be tested and verified by another process.

They're really just moving the problem to another level of abstraction. This approach is really "Turtles all the way down."

lopgok · Jul 24, 2024

This indicates a cascading failure.
They should have tested their update on a real computer.
They should have done a staggered deployment.
They should have used a memory safe language.
They should have a parser that doesn't do null pointer dereferences on bad input.
I don;t trust crowdstrike. They have demonstrate epic cluelessness when it comes to basic software and security design and implantation. I suspect many customers will move to a better solution.

halars · Jul 24, 2024

szbalint said:
They need to go beyond this and either switch the kernel modules to a memory safe language and implementation and/or switch to better standardized APIs so that no validation check issue can result in a crash loop like this ever again.

Edit: on more recent linux kernels Crowdstrike uses ebpf-sandboxed code, something similar would be needed on Windows

A memory safe language would not fix the issue with trusting/executing external files.

no_free_lunches_for_ai · Jul 24, 2024

FinallyAnAccount said:
So are they saying that they relied on a content validator instead of pushing to an actual system? The fact that this happened to every Windows system it touched is damning.

I think that's code for "Unit tests".

SplatMan_DK · Jul 24, 2024

Some analysts online have shown debugging data from crash dumps and minimal reverse engineering. By their account it's a null reference to a pointer in a system driver. That's something unit testing should have easily caught ... if used.

So here is what we know.

Trivial error in the software, running as a system driver.
Insufficient testing.
Insufficient control over large scale rollouts.
Not previously sharing release notes with customers.
Not previously allowing customers to control timing of rollouts.
Not previously allowing customers to use automated staged rollouts.

As someone working with governance in Enterprise IT, I am astonished they got this big without their customers challenging these things.

It's truly a WTF moment for the industry.

cyberfunk · Jul 24, 2024

Sheep Disorder said:
As been addressed far too many times, it's laziness and incompetence not a necessary evil. Some that should also be attributed to Microsoft especially OS design.

I would like to know how you propose to run malware detection and analysis from an unprivileged position that's not allowed to do things like arbitrary memory and process and file inspections ?

alansh42 · Jul 24, 2024

Another worrying issue is that if their agent will accept a malformed channel file, how hard would it be to get it to load a phony channel file? We already know its validation is broken.

real mikeb_60 · Jul 24, 2024

cyberfunk said:
Yea, i've been part of things that have gone through the root cause analysis and come out.. "ok.. that's a one in a million chance of all those bad things lining up like this, and it shouldnt have happened but it's understandable how this didn't get caught" .

This was absolutely not that. It was clearly one of those very preventable things.

Many of the aircraft investigations I've seen have a strong flavor of "how well the holes in the swiss cheese slices lined up" to them. Somehow, I don't expect that in this case, but it's possible.

Fingers crossed, I haven't noticed many (any?) major problems with MS updates recently. Including the sometimes several-times-daily "security intelligence" pushes.

EDIT: ninja'd of course.

samanime · Jul 24, 2024

cyberfunk said:
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.

Exactly. This mistake should bankrupt them, because they clearly are MAJORLY deficit in their testing and controls of something so vital.

There should have been MULTIPLE layers of testing that could have all caught this issue. It's the year 2024. We've had these testing best practices locked down for ages.

This wasn't some crazy, super-hard-to-catch corner case that affects only 0.00001% of users with a very unusual and specific configuration and circumstances. This affected everyone (or nearly so).

It should never have happened.

Allowing subscribing to release notes and adding a few more verifications is not a proper fix.

GlockenspielHero · Jul 24, 2024

cyberfunk said:
Send him an email, and ask him how he likes the stock price now

Right now it's up almost 7% for the year.

Sure, it had been up a lot more than that a month ago, but the market doesn't seem to think this is an existential crisis.

/Still can't believe they just rolled the patch out without testing on a real box.

fuzzyfuzzyfungus · Jul 24, 2024

The aspect that seems particularly alarming(and which they are not talking about) is that the kernel driver component was apparently willing to accept a malformed update on the basis of nothing but a header and then keel over and die.

Certainly having an actual testing process would be nice; but (especially when the whole point of your software is that there might be adversarial activity on the system) it seems like a deep and fundamental problem that such a high-privilege/high-criticality component is so brittle against malformed input.

Even the most impeccable testing can only assure you that your inputs won't cause your driver to misbehave; they can't assure you that you will always remain in control of which inputs your driver ends up chewing on.

sword_9mm · Jul 24, 2024

cyberfunk said:
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.

Kind of wonder why they don't test internally.

Then if it screws up it's just crowdstrike pcs down and they can deal with it.

barich · Jul 24, 2024

alansh42 said:
Another worrying issue is that if their agent will accept a malformed channel file, how hard would it be to get it to load a phony channel file? We already know its validation is broken.

I was wondering about this, too. Seems like potentially low-hanging fruit for malware authors.

Of course there's plenty of malware around that doesn't do as much damage as CrowdStrike themselves did.

A.Felix · Jul 24, 2024

Unit testing and validators are tools that help you when you're refactoring code or adding new stuff. They're there to prevent you from breaking things when you go in there making changes, and enforce requirements that the code is supposed to test against. They're most definitely not intended to be replacement for actual testing. This thing had a 100% breakage rate. Do they seriously push out stuff without putting it on even a single actual machine? Because that's very concerning. Hell, even an integration test should've caught this. They say there was a bug in the validator, and sure, I believe that, but they never ran the file against the real parser or whatever was going to consume it? Because it sounds like that's what they did or they would've noticed the crash. How is this not a part of the testing process?

CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

Ars Praefectus

Ars Scholae Palatinae

Account Banned

Ars Praefectus

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Centurion

Ars Scholae Palatinae

Ars Praefectus

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Centurion

Ars Scholae Palatinae

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Smack-Fu Master, in training

Seniorius Lurkius

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Tribunus Militum