CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

cyberfunk

Ars Scholae Palatinae
1,400
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
 
Last edited:
Upvote
440 (443 / -3)
tldr "we'll prevent this by using bog standard industry practices that are literally taught in schools."

Then they give out $10 Uber Eats gift cards as a "thank you" and then even those don't work because they themselves cancelled them after issuance. Can't even roll out a fucking gift card.

Just astonishing.


Edit: a little anecdote from me, I recently interviewed there for an infrastructure/platform engineering position. Development pipeline was literally something my team would own (for that specific area/team not whole company). I went through the full round and got no offer, but it was a huge red flag to me that the hiring manager, when asked why he liked working at CrowdStrike, wouldn't shut up about the "stock price." He seemed not very interested in much beyond the fact that it was "fast growing." I suspect many managers were there for the IPO and don't give a fuck as long as they get their payout. To be fair it was an overall good interview process with no BS and people I know do like working there. But still...
 
Last edited:
Upvote
311 (313 / -2)

cyberfunk

Ars Scholae Palatinae
1,400
"full Root Cause Analysis". Well, one of the problems was that their software is running as root.
As has been previously addressed, maybe they shouldn't be running certain parts in Ring 0, but AV software necessarily has to run with very high privileges to do the job. It's an unfortunately necessary evil.

What's indefensible is that their CI/CD mechanisms were so shoddy that they didn't catch this.
 
Upvote
133 (141 / -8)

jhodge

Ars Tribunus Angusticlavius
8,663
Subscriptor++
Interesting that they detail testing some releases in a staging environment containing multiple types of sample systems:

"On March 05, 2024, a stress test of the IPC Template Type was executed in our staging environment, which consists of a variety of operating systems and workloads. The IPC Template Type passed the stress test and was validated for use."

...but not for the update that caused this problem:

"On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data."

Which at least strongly implies that content updates are not routinely tested in staging, but rather solely through the validator tool. I'm sure there are reasons, probably involving speed of deployment since they're previously mentioned deploying these content updates multiple times per day.

I hope that one of the lessons learned from this is that there is no substitute for testing in a pre-production environment that represents prod with some degree of accuracy.
 
Upvote
137 (137 / 0)

szbalint

Ars Centurion
305
Subscriptor++
The company is specifically including "additional validation checks to the Content Validator"
They need to go beyond this and either switch the kernel modules to a memory safe language and implementation and/or switch to better standardized APIs so that no validation check issue can result in a crash loop like this ever again.

Edit: on more recent linux kernels Crowdstrike uses ebpf-sandboxed code, something similar would be needed on Windows
 
Upvote
11 (32 / -21)

cyberfunk

Ars Scholae Palatinae
1,400
Interesting that they detail testing some releases in a staging environment containing multiple types of sample systems:

"On March 05, 2024, a stress test of the IPC Template Type was executed in our staging environment, which consists of a variety of operating systems and workloads. The IPC Template Type passed the stress test and was validated for use."

...but not for the update that caused this problem:

"On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data."

Which at least strongly implies that content updates are not routinely tested in staging, but rather solely through the validator tool. I'm sure there are reasons, probably involving speed of deployment since they're previously mentioned deploying these content updates multiple times per day.

I hope that one of the lessons learned from this is that there is no substitute for testing in a pre-production environment that represents prod with some degree of accuracy.
Yea, you can't just rely on unit tests and validators and sub-module checks, etc. They are good practice but are insufficient to ship. Seems like someone didn't learn that very important lesson in their coding bootcamp.

The embarrassing thing ? I'm a fucking product manager who doesn't write a line of code and I know this. It's so basic that the "dumb product guys" who don't understand all the details of engineering devops get it.
 
Upvote
157 (157 / 0)

cyberfunk

Ars Scholae Palatinae
1,400
I went through the full round and got no offer, but it was a huge red flag to me that the hiring manager, when asked why he liked working at CrowdStrike, wouldn't shut up about the "stock price." He seemed not very interested in much beyond the fact that it was "fast growing." I suspect many managers were there for the IPO and don't give a fuck as long as they get their payout.
Send him an email, and ask him how he likes the stock price now :judge:
:sneaky:
 
Upvote
193 (195 / -2)

starglider

Ars Scholae Palatinae
1,141
Subscriptor++
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
Typically in a case like this, the root cause analysis is kind of crazy to read. Obviously, the company is at fault, but the accident chain is always a pretty wild series of coincidences that always leaves me with some degree of feeling "there but for the grace of God go I." The most recent Microsoft hack was a result of this ultra-improbable cascade of a crash dump in the super-secure system ending up on a less-secure dev's machine for debugging, an obscure bug that didn't redact data in the crash dump, the dev's machine's being compromised, and the attackers just getting ridiculously lucky as they were combing through it, etc. Airline disasters have the same tone to them; it's always a chain of improbable stuff, even if there were serious mistakes along the way.

This one is just . . . they didn't stage the rollout and didn't, like, test it? WTF? I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.
 
Upvote
235 (236 / -1)

cyberfunk

Ars Scholae Palatinae
1,400
Typically in a case like this, the root cause analysis is kind of crazy to read. Obviously, the company is at fault, but the accident chain is always a pretty wild series of coincidences that always leaves me with some degree of feeling "there but for the grace of God go I." The most recent Microsoft hack was a result of this ultra-improbable cascade of a crash dump in the super-secure system ending up on a less-secure dev's machine for debugging, an obscure bug that didn't redact data in the crash dump, the dev's machine's being compromised, and the attackers just getting ridiculously lucky as they were combing through it, etc. Airline disasters have the same tone to them; it's always a chain of improbable stuff, even if there were serious mistakes along the way.

This one is just . . . they didn't stage the rollout and didn't, like, test it? WTF? I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.
Yea, i've been part of things that have gone through the root cause analysis and come out.. "ok.. that's a one in a million chance of all those bad things lining up like this, and it shouldnt have happened but it's understandable how this didn't get caught" .

This was absolutely not that. It was clearly one of those very preventable things.
 
Upvote
114 (114 / 0)

Internet_Explorer

Ars Centurion
318
Subscriptor++
The biggest change will probably be "a staggered deployment strategy for Rapid Response Content" going forward. In a staggered deployment system, updates are initially released to a small group of PCs, and then availability is slowly expanded once it becomes clear that the update isn't causing major problems.
How is it possible, in the year 2024, that this wasn't already in place? Pure incompetence.
 
Upvote
100 (100 / 0)

l8gravely

Ars Scholae Palatinae
729
Subscriptor++
Obviously they don't do fuzz-testing on their validator, feeding it all kinds of wonky input to see that it catches things. This is where you have a prize of a beer party once a month for the team that has the fewest bugs, the developers or the SQA people who break things. If SQA can break it, then they get the beer party!
 
Upvote
63 (63 / 0)

grimmm

Wise, Aged Ars Veteran
143
Interesting that they detail testing some releases in a staging environment containing multiple types of sample systems:

"On March 05, 2024, a stress test of the IPC Template Type was executed in our staging environment, which consists of a variety of operating systems and workloads. The IPC Template Type passed the stress test and was validated for use."

...but not for the update that caused this problem:

"On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data."
Swiss cheese theory of holes aligning, a bad template instance shouldn't be able to hose the core sensor, but a second bug might make that possible (especially if broken content is typically weeded out at the validation phase).

https://en.m.wikipedia.org/wiki/Swiss_cheese_model
 
Upvote
-5 (11 / -16)

l8gravely

Ars Scholae Palatinae
729
Subscriptor++
Yea, i've been part of things that have gone through the root cause analysis and come out.. "ok.. that's a one in a million chance of all those bad things lining up like this, and it shouldnt have happened but it's understandable how this didn't get caught" .

This was absolutely not that. It was clearly one of those very preventable things.
It's the swiss cheese model... all the holes lined up and the problem got through. This is big in aviation where one little mistake if it lines up with other little problems can litterally kill you.
 
Upvote
23 (29 / -6)
Post content hidden for low score. Show…

grimmm

Wise, Aged Ars Veteran
143
Typically in a case like this, the root cause analysis is kind of crazy to read. Obviously, the company is at fault, but the accident chain is always a pretty wild series of coincidences that always leaves me with some degree of feeling "there but for the grace of God go I." The most recent Microsoft hack was a result of this ultra-improbable cascade of a crash dump in the super-secure system ending up on a less-secure dev's machine for debugging, an obscure bug that didn't redact data in the crash dump, the dev's machine's being compromised, and the attackers just getting ridiculously lucky as they were combing through it, etc. Airline disasters have the same tone to them; it's always a chain of improbable stuff, even if there were serious mistakes along the way.
I believe no hard evidence for the crash dump theory was ever provided and as such it was later recanted.
 
Upvote
19 (21 / -2)

Happy Medium

Ars Tribunus Militum
2,147
Subscriptor++
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
Because the above would cost money and require some degree of forethought and caution, and that's a crime for a CEO! Don't you know that a CEO's highest duty is to make as much money as possible? This duty explains and totally excuses literally putting lives at risk, abusing their employees, breaking laws, having no morals, and assisting others with crimes. Don't you know that a CEO doesn't want to torture small puppies and take food from the mouths of infants, their fiduciary duty requires them to do so. That's why they pay themselves so much, and make everyone but themselves pay the price for their failures, because of how much being a leader is an imposition on them! /s
 
Upvote
16 (37 / -21)

Robscura

Smack-Fu Master, in training
58
Subscriptor++
Who Validates the Validator?

Rolling out software updates is hard, especially across machines that are all configured differently. While we should applaud their use of a 'Validation' step, whatever validity that step possesses must be tested and verified by another process.

They're really just moving the problem to another level of abstraction. This approach is really "Turtles all the way down."
 
Last edited:
Upvote
36 (37 / -1)

lopgok

Seniorius Lurkius
21
This indicates a cascading failure.
They should have tested their update on a real computer.
They should have done a staggered deployment.
They should have used a memory safe language.
They should have a parser that doesn't do null pointer dereferences on bad input.
I don;t trust crowdstrike. They have demonstrate epic cluelessness when it comes to basic software and security design and implantation. I suspect many customers will move to a better solution.
 
Upvote
36 (43 / -7)

halars

Wise, Aged Ars Veteran
166
They need to go beyond this and either switch the kernel modules to a memory safe language and implementation and/or switch to better standardized APIs so that no validation check issue can result in a crash loop like this ever again.

Edit: on more recent linux kernels Crowdstrike uses ebpf-sandboxed code, something similar would be needed on Windows
A memory safe language would not fix the issue with trusting/executing external files.
 
Upvote
68 (69 / -1)

SplatMan_DK

Ars Tribunus Angusticlavius
8,234
Subscriptor++
Some analysts online have shown debugging data from crash dumps and minimal reverse engineering. By their account it's a null reference to a pointer in a system driver. That's something unit testing should have easily caught ... if used.

So here is what we know.

  • Trivial error in the software, running as a system driver.
  • Insufficient testing.
  • Insufficient control over large scale rollouts.
  • Not previously sharing release notes with customers.
  • Not previously allowing customers to control timing of rollouts.
  • Not previously allowing customers to use automated staged rollouts.

As someone working with governance in Enterprise IT, I am astonished they got this big without their customers challenging these things.

It's truly a WTF moment for the industry.
 
Upvote
114 (116 / -2)

cyberfunk

Ars Scholae Palatinae
1,400
As been addressed far too many times, it's laziness and incompetence not a necessary evil. Some that should also be attributed to Microsoft especially OS design.
I would like to know how you propose to run malware detection and analysis from an unprivileged position that's not allowed to do things like arbitrary memory and process and file inspections ?
 
Upvote
16 (26 / -10)

real mikeb_60

Ars Tribunus Angusticlavius
13,002
Subscriptor
Yea, i've been part of things that have gone through the root cause analysis and come out.. "ok.. that's a one in a million chance of all those bad things lining up like this, and it shouldnt have happened but it's understandable how this didn't get caught" .

This was absolutely not that. It was clearly one of those very preventable things.
Many of the aircraft investigations I've seen have a strong flavor of "how well the holes in the swiss cheese slices lined up" to them. Somehow, I don't expect that in this case, but it's possible.

Fingers crossed, I haven't noticed many (any?) major problems with MS updates recently. Including the sometimes several-times-daily "security intelligence" pushes.

EDIT: ninja'd of course.
 
Upvote
17 (17 / 0)

samanime

Ars Tribunus Militum
1,878
Subscriptor++
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
Exactly. This mistake should bankrupt them, because they clearly are MAJORLY deficit in their testing and controls of something so vital.

There should have been MULTIPLE layers of testing that could have all caught this issue. It's the year 2024. We've had these testing best practices locked down for ages.

This wasn't some crazy, super-hard-to-catch corner case that affects only 0.00001% of users with a very unusual and specific configuration and circumstances. This affected everyone (or nearly so).

It should never have happened.

Allowing subscribing to release notes and adding a few more verifications is not a proper fix.
 
Upvote
48 (53 / -5)

GlockenspielHero

Ars Scholae Palatinae
687
Subscriptor
Send him an email, and ask him how he likes the stock price now :judge:
:sneaky:

Right now it's up almost 7% for the year.

Sure, it had been up a lot more than that a month ago, but the market doesn't seem to think this is an existential crisis.

/Still can't believe they just rolled the patch out without testing on a real box.
 
Upvote
43 (45 / -2)
The aspect that seems particularly alarming(and which they are not talking about) is that the kernel driver component was apparently willing to accept a malformed update on the basis of nothing but a header and then keel over and die.

Certainly having an actual testing process would be nice; but (especially when the whole point of your software is that there might be adversarial activity on the system) it seems like a deep and fundamental problem that such a high-privilege/high-criticality component is so brittle against malformed input.

Even the most impeccable testing can only assure you that your inputs won't cause your driver to misbehave; they can't assure you that you will always remain in control of which inputs your driver ends up chewing on.
 
Upvote
77 (77 / 0)

sword_9mm

Ars Legatus Legionis
25,727
Subscriptor
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.

Kind of wonder why they don't test internally.

Then if it screws up it's just crowdstrike pcs down and they can deal with it.
 
Upvote
23 (23 / 0)

barich

Ars Legatus Legionis
10,742
Subscriptor++
Another worrying issue is that if their agent will accept a malformed channel file, how hard would it be to get it to load a phony channel file? We already know its validation is broken.

I was wondering about this, too. Seems like potentially low-hanging fruit for malware authors.

Of course there's plenty of malware around that doesn't do as much damage as CrowdStrike themselves did.
 
Upvote
36 (36 / 0)

A.Felix

Ars Tribunus Militum
2,655
Subscriptor
Unit testing and validators are tools that help you when you're refactoring code or adding new stuff. They're there to prevent you from breaking things when you go in there making changes, and enforce requirements that the code is supposed to test against. They're most definitely not intended to be replacement for actual testing. This thing had a 100% breakage rate. Do they seriously push out stuff without putting it on even a single actual machine? Because that's very concerning. Hell, even an integration test should've caught this. They say there was a bug in the validator, and sure, I believe that, but they never ran the file against the real parser or whatever was going to consume it? Because it sounds like that's what they did or they would've noticed the crash. How is this not a part of the testing process?
 
Upvote
65 (65 / 0)