CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

ranthog · Jul 25, 2024

Ganz said:
It's not always impossible. It's often infeasible (read: not as profitable). But anyway ...

I'm definitely not in a position to roll my eyes and say that it's obvious that tests should have caught this - I have no idea how their system works. You appear to know for sure, which is great - they should fix their dumb tests. I'm down.

All I personally can roll my eyes at is their "everything, everywhere, all at once" deployment model for these Template updates, and fixing that would certainly have caught this problem before it got to paying customers.

This is not an either/or thing, so your assertion that they should fix their obviously stupid tests does not nullify my assertion that their deployment model is bonkers crazy-go-nuts. Because it's bonkers crazy-go-nuts.

I suspect their work involves a lot of things were testing is limited due to the extremely high number of combinations possible.

I am assuming that Crowd Strike is not blatantly lying to us. (This may be a bad assumption on my part.) It really does seem like they are purely relying on this coded check on the update. (These types of analytical checks are critical to testing, but you shouldn't just rely on them.) They aren't admitting to anything else like their development team skipping testing procedures.

So given how broad the problems were, it would be really hard for the final package to have been run without triggering the reboot cycle. That is the type of basic functionality test that being skipped is a sign of some very fundamentally broken engineering practices going on.

Why I don't think deployment is going to fix this is because you need to fix the deeper problem where move fast and break things is acceptable, even when it means breaking millions of customers systems.

I don't think I did a very good job of explaining that, though. I think I was still in shock and horror that they relied on that one test set to test the update.

Deployment needs to be be dialed back down from "test on all the live systems at once" setting. I do think phasing in the deployment likely is going to be a bit harder and less effective with the faster timelines they probably do need for more rapidly deployed updates.

markgo · Jul 25, 2024

ranthog said:
This is also why you usually at least do a test deployment to run the program and do some basic sanity checks, even when your automated tests came back green.

Or stagger your deployment, like responsible developers have been doing since the late 90s.

torp · Jul 25, 2024

But but... I thought it's the EU's fault?

macosandlinux · Jul 25, 2024

cyberfunk said:
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.

When you know who sails the ship over there, you'll know. Proven track record to wreck havoc.

macosandlinux · Jul 25, 2024

markgo said:
So they admit they deployed worldwide all at once. That has been against best practices for large scale deployments for more than two decades. Sue them into oblivion.

Let the CEO bleed $, too!

macosandlinux · Jul 25, 2024

OtherSystemGuy said:
Let me get this straight. Ignoring the boneheaded deployment issue, a company that sells cyber security software has a sloppy verification process? That seems to belie the very essence of software testing.

Why would I ever trust them?

After this public FU they deserve and should loose all their customers - even if it takes time for them to move to a competitor.

Rosyna · Jul 25, 2024

CrowdStrike has released their report on why the 291*.*32.sys file may contain NULL bytes.

TL;DR it’s an artifact of Windows/NTFS that occurs after the computer crashed. This explains why so many affected people had such different copies of that definition file.

Chuckstar · Jul 25, 2024

ranthog said:
With that, making sure a mirror grinding rig works correctly is much easier.

All it took was a paint chip.

henryhbk · Jul 25, 2024

philipjohnstephens said:
Ummm, shouldn't they be testing the KERNEL DRIVER to ensure that it correctly rejects malformed content? It's simply not good enough to have a "Content Validator" that is intended to spare the kernel driver from being exposed to malformed content that would cause it to crash. Maybe this validator is based on the same code as used in their driver, but if so, I hope they are planning on FIXING the driver and not just the external validator.

Better question is how has this never happened before? I mean yes networking is very reliable and most download code does a good job of recovering bad downloads, but just the shear volume means someone must have gotten a corrupted download in the past. Did their machine bsod and they simply ignored it?

tjukken · Jul 25, 2024

Iguana7250 said:
Good thing I disabled Windows updates since Windows 7, never had an incident.

This has nothing to do with Windows updates, though..

ender78 · Jul 25, 2024

rbirling said:
Ok so is the company liable for any of the downstream damages? (I feel like I know the answer to this.)

As much as it sucks, it's not "reasonable" to expect damages greater than the costs of the tool. The lawyer writing the contract would never agree to limitless liability. The biggest thing at stake is the company's reputation. They won't be sued out of existence, the biggest liability is that all their customer will leave which will terminate its existance.

torp · Jul 25, 2024

ender78 said:
As much as it sucks, it's not "reasonable" to expect damages greater than the costs of the tool. The lawyer writing the contract would never agree to limitless liability. The biggest thing at stake is the company's reputation. They won't be sued out of existence, the biggest liability is that all their customer will leave which will terminate its existance.

Maybe in the US: off the top of HN, about France

moongoddess · Jul 25, 2024

Chuckstar said:
I look at this like not having a test of Hubble’s mirror independent of the grinding rig. You’d have to have a pretty stupid error to need such a test.

It’s the classic “if we do it right, the test will be unnecessary”. But that’s true of most tests, so why not just do the easy test to make sure?

The Hubble example you gave is actually a better example than you realize. The Hubble mirror actually WAS tested independently of the grinding rig. But the interferometer used to perform the testing was miscalibrated, and as a result it did not catch the spherical abberation of the mirror.

The results of your test are only as good as your tools! And in CrowdStrike’s case, it is obvious that whatever tools they used to test were not up to the job.

Xyler · Jul 25, 2024

RickRoyLeonPrisZhoraRacha said:
Microsoft is passing blame at EU for its policies...from 2009! Touche mon ami!

Don't really blame Microsoft here, they tried implementing an API that would harden the Kernel against this very issue, and the EU blocked MS from implementing it because they were afraid it would give Microsoft an advantage in the Security Software department.

tjukken · Jul 25, 2024

Xyler said:
Don't really blame Microsoft here, they tried implementing an API that would harden the Kernel against this very issue, and the EU blocked MS from implementing it because they were afraid it would give Microsoft an advantage in the Security Software department.

If Microsoft had implemented an API that all players would have to use, including Microsoft, then the EU would have been satisfied. Because that would have levelled the playing field and no one would have an advantage over another.

alansh42 · Jul 25, 2024

Rosyna said:
CrowdStrike has released their report on why the 291*.*32.sys file may contain NULL bytes.

TL;DR it’s an artifact of Windows/NTFS that occurs after the computer crashed. This explains why so many affected people had such different copies of that definition file.

This just explains why the file on disk may be all zeroes or be partly written, causing the file to be different on different systems. I think they're saying everybody got the same corrupt update, but it was further corrupted due to the crash.

Even apart from corruption issues, tampering with the channel files seems an obvious attack by malware. It should be validating the content and that all are present. Removing the bad file fixes the issue, but it means malware can disable detection by removing good files.

Spudster · Jul 25, 2024

cyberfunk said:
As has been previously addressed, maybe they shouldn't be running certain parts in Ring 0, but AV software necessarily has to run with very high privileges to do the job. It's an unfortunately necessary evil.

What's indefensible is that their CI/CD mechanisms were so shoddy that they didn't catch this.

Agreed. Due to the Windows architecture, it's required for this kind of software.

What's especially indefensible is their driver isn't doing ANY input validation on these updates. Dave's Garage has a good explainer.

Frankly, I'm surprised this behavior is even allowed for a Ring 0 driver.

steelcobra · Jul 25, 2024

Spudster said:
Agreed. Due to the Windows architecture, it's required for this kind of software.

What's especially indefensible is their driver isn't doing ANY input validation on these updates. Dave's Garage has a good explainer.

Frankly, I'm surprised this behavior is even allowed for a Ring 0 driver.

Like with most of these cases, it's because they did something that does an end run around how the API is supposed to normally work.

Errum · Jul 25, 2024

Quis custodiet ipsos custodes?

shoe · Jul 25, 2024

msawzall said:
Money.

What's Money exactly? It's time. It's resources. It's a measurement of risk.

I'm not saying you're wrong but that's like saying money is why it takes me an hour to get to work instead of 5 minutes or 2 days. On one level it's true but on another what you don't know is - am I only taking an hour because that's the most efficient way to get to work? Or am I taking advantage of work by getting there slower then I could? Or taking advantage of home by getting there slower then I could?

My point is that money is a metric. Implying that any time that someone does less or spends less then they otherwise could the reason is nefarious is unfair. It could also be incompetence. It could just as easily be that the way it was done really was the most efficient way possible.

Now in this case it really does seem incompetence. It also seems like Crowdstrike was using it's privileged position on it's client OS's as a lever to improve it's product - and so taking risks that arguably benefited Crowdstrike as much or more then the client. But just saying "money" doesn't tell the whole story.

DeeplyUnconcerned · Jul 25, 2024

alansh42 said:
This just explains why the file on disk may be all zeroes or be partly written, causing the file to be different on different systems. I think they're saying everybody got the same corrupt update, but it was further corrupted due to the crash.

Even apart from corruption issues, tampering with the channel files seems an obvious attack by malware. It should be validating the content and that all are present. Removing the bad file fixes the issue, but it means malware can disable detection by removing good files.

They’re not actually saying anywhere that the update as deployed was corrupt. The phrase they’re using is “problematic content data”, that “passed validation” due to a bug in the validator. The “problematic” part could be corruption, but it could equally just be that it was deployed as intended but that particular configuration causes the parser to crash. As others have noted, the report singles out the bug in the validator but glosses over what is presumably a major bug in the parser. Putting these pieces together, my read is that the file that was deployed “should have worked” from the perspective of whoever set it up, or at worst contained what should’ve been a benign misconfiguration, and the real failures in the intended process were in 1) the parser failing to handle it and 2) the stress test not properly testing all possible values. (You’ll note that the report claims the stress test process “match[es] against any possible value of the associated data fields to identify adverse system interactions”, which obviously didn’t actually happen here as if it’d matched against every possible data value, it would’ve caught this issue. Or, possibly, the stress test caught it but the validator want properly updated to detect the issue; in this case, though, they’re deploying content to a template with a known kernel crash bug, so there’s really no excuse for not being exceptionally careful with every aspect of that template until the crash bug is fixed.)

All of which would of course have been detected with even trivial pre-deployment testing, and mitigated with deployment in stages. (As outlined in the report, this class of update was never intended to be limited by the n-1 mechanism; that’s for code updates, not data updates.) Given that this process is supporting multiple updates every day, I understand the desire to operate in a way that minimises the delay between submitting and deploying widely, but they’ve clearly failed to build the infrastructure needed to support doing so. (For my money, main focuses should be that the driver needs to be way more robust to any possible misconfiguration, and the stress test needs at minimum to test every possible combination permitted by the validator, using the actual validator code to gate the inputs.)

Ganz · Jul 25, 2024

ender78 said:
As much as it sucks, it's not "reasonable" to expect damages greater than the costs of the tool. The lawyer writing the contract would never agree to limitless liability. The biggest thing at stake is the company's reputation. They won't be sued out of existence, the biggest liability is that all their customer will leave which will terminate its existance.

This is what software companies want you to believe. But when gross negligence enters the chat, I believe disclaimers of fitness for purpose are rendered ineffective as a defense.

shoe · Jul 25, 2024

Errum said:
Quis custodiet ipsos custodes?

How do you say OS in Latin? But I suppose once you're at ring 0 you are the OS. Still, why not ring -1 ?

gosand · Jul 25, 2024

OtherSystemGuy said:
Let me get this straight. Ignoring the boneheaded deployment issue, a company that sells cyber security software has a sloppy verification process? That seems to belie the very essence of software testing.

Why would I ever trust them?

Just to be clear - this seems to be a validation issue, not a verification issue.

Verification is ensuring you are building to specficiation. Validation is ensuring it is fit for use. Think of verification as internal testing, validation is external testing.

Source: me. I spent 20 years in software testing, and testing ALWAYS get the blame for not finding things. In this case, how things are deployed matter as well, and I'm pretty sure this is the news-making facepalm here.

Haravikk · Jul 25, 2024

"Testing bugs" as in "we didn't test it and just pushed a change to production without seeing if it would brick systems first"?

I mean it's not like only some systems were affected here, it seemed like pretty much literally every machine the update was installed to got bricked by it.

There's a reason we're supposed test things first, and do staggered rollouts etc.

real mikeb_60 · Jul 25, 2024

shoe said:
What's Money exactly? It's time. It's resources. It's a measurement of risk.

I'm not saying you're wrong but that's like saying money is why it takes me an hour to get to work instead of 5 minutes or 2 days. On one level it's true but on another what you don't know is - am I only taking an hour because that's the most efficient way to get to work? Or am I taking advantage of work by getting there slower then I could? Or taking advantage of home by getting there slower then I could?

My point is that money is a metric. Implying that any time that someone does less or spends less then they otherwise could the reason is nefarious is unfair. It could also be incompetence. It could just as easily be that the way it was done really was the most efficient way possible.

Now in this case it really does seem incompetence. It also seems like Crowdstrike was using it's privileged position on it's client OS's as a lever to improve it's product - and so taking risks that arguably benefited Crowdstrike as much or more then the client. But just saying "money" doesn't tell the whole story.

It's been said for years: the game is not about the money per se; the money is just a way of keeping score.

ZTransform · Jul 25, 2024

Missed opportunity to change that Red Falcon to a 'Blue Falcon'. Because CrowdStrike surely Blue Falconed a lot of us. We are still finding pieces of our less critical infrastructure that need to be fixed...

If you've served in the military, you know what a Blue Falcon is.

Beckler · Jul 25, 2024

"The biggest change will probably be "a staggered deployment strategy..."

It makes you wonder what other obvious things they will still be overlooking then?

Nilt · Jul 25, 2024

ender78 said:
As much as it sucks, it's not "reasonable" to expect damages greater than the costs of the tool. The lawyer writing the contract would never agree to limitless liability. The biggest thing at stake is the company's reputation. They won't be sued out of existence, the biggest liability is that all their customer will leave which will terminate its existance.

Yes, because we all know Contracts Are Law.

As has been said multiple times throughout this story's coverage, this is not simple negligence. This is wildly over the edge of gross negligence. Nobody in their right minds thinks it's acceptable to not test an update on at least ONE computer before deploying it to everyone. That's the kind of liability you can't even disclaim in the US, let alone places with better protections for consumers. Regardless of their status as paper people, corporations subscribing to a service count as consumers.

malenisea · Jul 25, 2024

cyberfunk said:
How on earth did they not:
1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

This.

I work in the cybersecurity authentication sector: after committing my production ready code, it still takes a MONTH before reaching the general public. During that month, it will first reside in a live test environment where our QA will hammer on it, along with customers interesting in upcoming features.

Actually that's for server side code on the cloud. Our client side products/apps can each have a different release cadences, but will nevertheless expect to reside in TestFlight or their equivalent for a while.

Despite our best efforts coding and testing, we still expect to find bugs, hence the one month test and monitoring interval. Even then public rollout is staggered in groups. Hey Crowdstrike: bugs are not the failure, it was a process failure.

lost_packet · Jul 25, 2024

Spudster said:
What's especially indefensible is their driver isn't doing ANY input validation on these updates. Dave's Garage has a good explainer.

100% this. Dave is great at put this into terms a non-dev op can understand.

How this was not tested on live systems just blows my mind. Does that mean the end users are the test systems? Hopefully they get sued into oblivion.

launcap · Jul 26, 2024

cyberfunk said:
As has been previously addressed, maybe they shouldn't be running certain parts in Ring 0, but AV software necessarily has to run with very high privileges to do the job. It's an unfortunately necessary evil.

You missed two words: "on Windows".

Apple had the same directive from the EU and, rather than taking the path of least work, developed a mechanism so that AV clients don't have to run at ring 0.

Microsoft could have done so but chose not to.

launcap · Jul 26, 2024

psko said:
I've just heard that Microsoft wanted to provide such APIs for AV companies and apparently EU blocked the idea as it could harm competition

Nope. The EU required Microsoft (and Apple) to give AV vendors access to the same API as they use. Apple wrote a system so that AV didn't have to be ring0, Microsoft didn't.

cyberfunk · Jul 26, 2024

launcap said:
You missed two words: "on Windows".

Apple had the same directive from the EU and, rather than taking the path of least work, developed a mechanism so that AV clients don't have to run at ring 0.

Microsoft could have done so but chose not to.

Maybe technically correct but, pragmatically useless. You do realize that windows is in the business of supporting lots of legacy compatibility things right ?

The likelihood of suddenly changing the kernel model at this point is just not really reasonable without an absolutely massive amount of testing and inevitable issues.

Apple can afford to make compatibility breaks, Windows really can’t because of their market share and customer base.

steelcobra · Jul 26, 2024

launcap said:
Nope. The EU required Microsoft (and Apple) to give AV vendors access to the same API as they use. Apple wrote a system so that AV didn't have to be ring0, Microsoft didn't.

Apple also isn't as big of a target for malware creators, Microsoft likely saw critical value in being able to run Defender tasks in Ring 0 as a defense against that.

Great_Scott · Jul 26, 2024

Ganz said:
They said they did and do testing. The admission makes a distinction between testing (running code that challenges the new functions in various ways, ensuring the results are the ones expected) and incremental deployments with canaries.

You can think of deploying in waves as a test, but since you're deploying to production environments, a distinction is often (always, IME) made.

The problem with just testing is what we saw here: their tests were insufficient. Most organizations find it infeasible to prove mathematically that their testing is sufficient, so they fall back on sending updates out in waves in order to control the blast radius of a mistake. This was what Crowdstrike were missing with their so-called "Rapid Response" updates, and what they pledge to do in the future.

When I claimed that "they weren't doing any testing" in my previous post, I was implying that relying entirely on a automated testing routine isn't effective on its own.

From what I recall of all the articles I've read on the update/outage, Crowdstrike only ever directly referenced their one utility when going over the test procedures. Hopefully I'm wrong and they're being vague for security reasons, and I'd bet there's more of a focus on test procedures now.

As you mentioned, at the very least they should have been doing incremental updates. Even something as simple as "updating our own corporate resources before rolling out to customers" would help.

stux · Jul 28, 2024

So, no testing. No canary deployment. Got it.

If they were using staged deployments the first stage would’ve been their own test machines…

JasonRoy · Aug 21, 2024

they could have used ghostqa.com for automated testing. They even offer first month free.

CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

Ars Legatus Legionis

Ars Praefectus

Ars Praefectus

Ars Praefectus

Ars Praefectus

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Tribunus Militum

Ars Praefectus

Ars Tribunus Militum

Ars Praefectus

Ars Praetorian

Ars Scholae Palatinae

Ars Praefectus

Ars Praefectus

Seniorius Lurkius

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Praefectus

Ars Tribunus Angusticlavius

Wise, Aged Ars Veteran

Ars Praetorian

Ars Legatus Legionis

Seniorius Lurkius

Ars Praefectus

Ars Tribunus Militum

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Ars Scholae Palatinae

Smack-Fu Master, in training