CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

ranthog

Ars Legatus Legionis
15,240
It's not always impossible. It's often infeasible (read: not as profitable). But anyway ...

I'm definitely not in a position to roll my eyes and say that it's obvious that tests should have caught this - I have no idea how their system works. You appear to know for sure, which is great - they should fix their dumb tests. I'm down.

All I personally can roll my eyes at is their "everything, everywhere, all at once" deployment model for these Template updates, and fixing that would certainly have caught this problem before it got to paying customers.

This is not an either/or thing, so your assertion that they should fix their obviously stupid tests does not nullify my assertion that their deployment model is bonkers crazy-go-nuts. Because it's bonkers crazy-go-nuts.
I suspect their work involves a lot of things were testing is limited due to the extremely high number of combinations possible.

I am assuming that Crowd Strike is not blatantly lying to us. (This may be a bad assumption on my part.) It really does seem like they are purely relying on this coded check on the update. (These types of analytical checks are critical to testing, but you shouldn't just rely on them.) They aren't admitting to anything else like their development team skipping testing procedures.

So given how broad the problems were, it would be really hard for the final package to have been run without triggering the reboot cycle. That is the type of basic functionality test that being skipped is a sign of some very fundamentally broken engineering practices going on.

Why I don't think deployment is going to fix this is because you need to fix the deeper problem where move fast and break things is acceptable, even when it means breaking millions of customers systems.

I don't think I did a very good job of explaining that, though. I think I was still in shock and horror that they relied on that one test set to test the update.

Deployment needs to be be dialed back down from "test on all the live systems at once" setting. I do think phasing in the deployment likely is going to be a bit harder and less effective with the faster timelines they probably do need for more rapidly deployed updates.
 
Upvote
5 (5 / 0)
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
When you know who sails the ship over there, you'll know. Proven track record to wreck havoc.
 
Upvote
2 (2 / 0)
Let me get this straight. Ignoring the boneheaded deployment issue, a company that sells cyber security software has a sloppy verification process? That seems to belie the very essence of software testing.

Why would I ever trust them?
After this public FU they deserve and should loose all their customers - even if it takes time for them to move to a competitor.
 
Upvote
2 (2 / 0)
Post content hidden for low score. Show…

henryhbk

Ars Tribunus Militum
1,952
Subscriptor++
Ummm, shouldn't they be testing the KERNEL DRIVER to ensure that it correctly rejects malformed content? It's simply not good enough to have a "Content Validator" that is intended to spare the kernel driver from being exposed to malformed content that would cause it to crash. Maybe this validator is based on the same code as used in their driver, but if so, I hope they are planning on FIXING the driver and not just the external validator.
Better question is how has this never happened before? I mean yes networking is very reliable and most download code does a good job of recovering bad downloads, but just the shear volume means someone must have gotten a corrupted download in the past. Did their machine bsod and they simply ignored it?
 
Upvote
-1 (2 / -3)

ender78

Ars Tribunus Militum
1,881
Subscriptor
Ok so is the company liable for any of the downstream damages? (I feel like I know the answer to this.)

As much as it sucks, it's not "reasonable" to expect damages greater than the costs of the tool. The lawyer writing the contract would never agree to limitless liability. The biggest thing at stake is the company's reputation. They won't be sued out of existence, the biggest liability is that all their customer will leave which will terminate its existance.
 
Upvote
-5 (0 / -5)

torp

Ars Praefectus
3,369
Subscriptor
As much as it sucks, it's not "reasonable" to expect damages greater than the costs of the tool. The lawyer writing the contract would never agree to limitless liability. The biggest thing at stake is the company's reputation. They won't be sued out of existence, the biggest liability is that all their customer will leave which will terminate its existance.

Maybe in the US: off the top of HN, about France
 
Upvote
5 (5 / 0)

moongoddess

Ars Praetorian
537
Subscriptor++
I look at this like not having a test of Hubble’s mirror independent of the grinding rig. You’d have to have a pretty stupid error to need such a test. ;)

It’s the classic “if we do it right, the test will be unnecessary”. But that’s true of most tests, so why not just do the easy test to make sure?

The Hubble example you gave is actually a better example than you realize. The Hubble mirror actually WAS tested independently of the grinding rig. But the interferometer used to perform the testing was miscalibrated, and as a result it did not catch the spherical abberation of the mirror.

The results of your test are only as good as your tools! And in CrowdStrike’s case, it is obvious that whatever tools they used to test were not up to the job.
 
Upvote
8 (8 / 0)

Xyler

Ars Scholae Palatinae
1,357
Microsoft is passing blame at EU for its policies...from 2009! Touche mon ami!
Don't really blame Microsoft here, they tried implementing an API that would harden the Kernel against this very issue, and the EU blocked MS from implementing it because they were afraid it would give Microsoft an advantage in the Security Software department.
 
Upvote
-5 (6 / -11)

tjukken

Ars Praefectus
4,004
Subscriptor
Don't really blame Microsoft here, they tried implementing an API that would harden the Kernel against this very issue, and the EU blocked MS from implementing it because they were afraid it would give Microsoft an advantage in the Security Software department.
If Microsoft had implemented an API that all players would have to use, including Microsoft, then the EU would have been satisfied. Because that would have levelled the playing field and no one would have an advantage over another.
 
Upvote
9 (9 / 0)

alansh42

Ars Praefectus
3,597
Subscriptor++
CrowdStrike has released their report on why the 291*.*32.sys file may contain NULL bytes.

TL;DR it’s an artifact of Windows/NTFS that occurs after the computer crashed. This explains why so many affected people had such different copies of that definition file.
This just explains why the file on disk may be all zeroes or be partly written, causing the file to be different on different systems. I think they're saying everybody got the same corrupt update, but it was further corrupted due to the crash.

Even apart from corruption issues, tampering with the channel files seems an obvious attack by malware. It should be validating the content and that all are present. Removing the bad file fixes the issue, but it means malware can disable detection by removing good files.
 
Upvote
10 (10 / 0)

Spudster

Seniorius Lurkius
16
Subscriptor
As has been previously addressed, maybe they shouldn't be running certain parts in Ring 0, but AV software necessarily has to run with very high privileges to do the job. It's an unfortunately necessary evil.

What's indefensible is that their CI/CD mechanisms were so shoddy that they didn't catch this.
Agreed. Due to the Windows architecture, it's required for this kind of software.

What's especially indefensible is their driver isn't doing ANY input validation on these updates. Dave's Garage has a good explainer.

Frankly, I'm surprised this behavior is even allowed for a Ring 0 driver.
 
Upvote
10 (10 / 0)

steelcobra

Ars Tribunus Angusticlavius
9,774
Agreed. Due to the Windows architecture, it's required for this kind of software.

What's especially indefensible is their driver isn't doing ANY input validation on these updates. Dave's Garage has a good explainer.

Frankly, I'm surprised this behavior is even allowed for a Ring 0 driver.
Like with most of these cases, it's because they did something that does an end run around how the API is supposed to normally work.
 
Upvote
0 (0 / 0)

shoe

Ars Scholae Palatinae
1,021
Subscriptor
What's Money exactly? It's time. It's resources. It's a measurement of risk.

I'm not saying you're wrong but that's like saying money is why it takes me an hour to get to work instead of 5 minutes or 2 days. On one level it's true but on another what you don't know is - am I only taking an hour because that's the most efficient way to get to work? Or am I taking advantage of work by getting there slower then I could? Or taking advantage of home by getting there slower then I could?

My point is that money is a metric. Implying that any time that someone does less or spends less then they otherwise could the reason is nefarious is unfair. It could also be incompetence. It could just as easily be that the way it was done really was the most efficient way possible.

Now in this case it really does seem incompetence. It also seems like Crowdstrike was using it's privileged position on it's client OS's as a lever to improve it's product - and so taking risks that arguably benefited Crowdstrike as much or more then the client. But just saying "money" doesn't tell the whole story.
 
Upvote
4 (4 / 0)

DeeplyUnconcerned

Ars Scholae Palatinae
1,017
Subscriptor++
This just explains why the file on disk may be all zeroes or be partly written, causing the file to be different on different systems. I think they're saying everybody got the same corrupt update, but it was further corrupted due to the crash.

Even apart from corruption issues, tampering with the channel files seems an obvious attack by malware. It should be validating the content and that all are present. Removing the bad file fixes the issue, but it means malware can disable detection by removing good files.
They’re not actually saying anywhere that the update as deployed was corrupt. The phrase they’re using is “problematic content data”, that “passed validation” due to a bug in the validator. The “problematic” part could be corruption, but it could equally just be that it was deployed as intended but that particular configuration causes the parser to crash. As others have noted, the report singles out the bug in the validator but glosses over what is presumably a major bug in the parser. Putting these pieces together, my read is that the file that was deployed “should have worked” from the perspective of whoever set it up, or at worst contained what should’ve been a benign misconfiguration, and the real failures in the intended process were in 1) the parser failing to handle it and 2) the stress test not properly testing all possible values. (You’ll note that the report claims the stress test process “match[es] against any possible value of the associated data fields to identify adverse system interactions”, which obviously didn’t actually happen here as if it’d matched against every possible data value, it would’ve caught this issue. Or, possibly, the stress test caught it but the validator want properly updated to detect the issue; in this case, though, they’re deploying content to a template with a known kernel crash bug, so there’s really no excuse for not being exceptionally careful with every aspect of that template until the crash bug is fixed.)

All of which would of course have been detected with even trivial pre-deployment testing, and mitigated with deployment in stages. (As outlined in the report, this class of update was never intended to be limited by the n-1 mechanism; that’s for code updates, not data updates.) Given that this process is supporting multiple updates every day, I understand the desire to operate in a way that minimises the delay between submitting and deploying widely, but they’ve clearly failed to build the infrastructure needed to support doing so. (For my money, main focuses should be that the driver needs to be way more robust to any possible misconfiguration, and the stress test needs at minimum to test every possible combination permitted by the validator, using the actual validator code to gate the inputs.)
 
Upvote
5 (5 / 0)

Ganz

Ars Scholae Palatinae
757
As much as it sucks, it's not "reasonable" to expect damages greater than the costs of the tool. The lawyer writing the contract would never agree to limitless liability. The biggest thing at stake is the company's reputation. They won't be sued out of existence, the biggest liability is that all their customer will leave which will terminate its existance.
This is what software companies want you to believe. But when gross negligence enters the chat, I believe disclaimers of fitness for purpose are rendered ineffective as a defense.
 
Upvote
6 (6 / 0)

gosand

Ars Tribunus Militum
1,654
Let me get this straight. Ignoring the boneheaded deployment issue, a company that sells cyber security software has a sloppy verification process? That seems to belie the very essence of software testing.

Why would I ever trust them?
Just to be clear - this seems to be a validation issue, not a verification issue.

Verification is ensuring you are building to specficiation. Validation is ensuring it is fit for use. Think of verification as internal testing, validation is external testing.

Source: me. I spent 20 years in software testing, and testing ALWAYS get the blame for not finding things. In this case, how things are deployed matter as well, and I'm pretty sure this is the news-making facepalm here.
 
Upvote
0 (1 / -1)
"Testing bugs" as in "we didn't test it and just pushed a change to production without seeing if it would brick systems first"?

I mean it's not like only some systems were affected here, it seemed like pretty much literally every machine the update was installed to got bricked by it.

There's a reason we're supposed test things first, and do staggered rollouts etc.
 
Last edited:
Upvote
4 (4 / 0)

real mikeb_60

Ars Tribunus Angusticlavius
13,002
Subscriptor
What's Money exactly? It's time. It's resources. It's a measurement of risk.

I'm not saying you're wrong but that's like saying money is why it takes me an hour to get to work instead of 5 minutes or 2 days. On one level it's true but on another what you don't know is - am I only taking an hour because that's the most efficient way to get to work? Or am I taking advantage of work by getting there slower then I could? Or taking advantage of home by getting there slower then I could?

My point is that money is a metric. Implying that any time that someone does less or spends less then they otherwise could the reason is nefarious is unfair. It could also be incompetence. It could just as easily be that the way it was done really was the most efficient way possible.

Now in this case it really does seem incompetence. It also seems like Crowdstrike was using it's privileged position on it's client OS's as a lever to improve it's product - and so taking risks that arguably benefited Crowdstrike as much or more then the client. But just saying "money" doesn't tell the whole story.
It's been said for years: the game is not about the money per se; the money is just a way of keeping score.
 
Upvote
5 (5 / 0)

Nilt

Ars Legatus Legionis
21,810
Subscriptor++
As much as it sucks, it's not "reasonable" to expect damages greater than the costs of the tool. The lawyer writing the contract would never agree to limitless liability. The biggest thing at stake is the company's reputation. They won't be sued out of existence, the biggest liability is that all their customer will leave which will terminate its existance.
Yes, because we all know Contracts Are Law. :rolleyes:

As has been said multiple times throughout this story's coverage, this is not simple negligence. This is wildly over the edge of gross negligence. Nobody in their right minds thinks it's acceptable to not test an update on at least ONE computer before deploying it to everyone. That's the kind of liability you can't even disclaim in the US, let alone places with better protections for consumers. Regardless of their status as paper people, corporations subscribing to a service count as consumers.
 
Upvote
6 (6 / 0)

malenisea

Seniorius Lurkius
22
Subscriptor
How on earth did they not:
1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?
This.

I work in the cybersecurity authentication sector: after committing my production ready code, it still takes a MONTH before reaching the general public. During that month, it will first reside in a live test environment where our QA will hammer on it, along with customers interesting in upcoming features.

Actually that's for server side code on the cloud. Our client side products/apps can each have a different release cadences, but will nevertheless expect to reside in TestFlight or their equivalent for a while.

Despite our best efforts coding and testing, we still expect to find bugs, hence the one month test and monitoring interval. Even then public rollout is staggered in groups. Hey Crowdstrike: bugs are not the failure, it was a process failure.
 
Upvote
6 (7 / -1)
What's especially indefensible is their driver isn't doing ANY input validation on these updates. Dave's Garage has a good explainer.
100% this. Dave is great at put this into terms a non-dev op can understand.

How this was not tested on live systems just blows my mind. Does that mean the end users are the test systems? Hopefully they get sued into oblivion.
 
Upvote
1 (2 / -1)

launcap

Ars Tribunus Militum
1,778
As has been previously addressed, maybe they shouldn't be running certain parts in Ring 0, but AV software necessarily has to run with very high privileges to do the job. It's an unfortunately necessary evil.

You missed two words: "on Windows".

Apple had the same directive from the EU and, rather than taking the path of least work, developed a mechanism so that AV clients don't have to run at ring 0.

Microsoft could have done so but chose not to.
 
Upvote
-6 (1 / -7)

launcap

Ars Tribunus Militum
1,778
I've just heard that Microsoft wanted to provide such APIs for AV companies and apparently EU blocked the idea as it could harm competition

Nope. The EU required Microsoft (and Apple) to give AV vendors access to the same API as they use. Apple wrote a system so that AV didn't have to be ring0, Microsoft didn't.
 
Upvote
4 (4 / 0)

cyberfunk

Ars Scholae Palatinae
1,400
You missed two words: "on Windows".

Apple had the same directive from the EU and, rather than taking the path of least work, developed a mechanism so that AV clients don't have to run at ring 0.

Microsoft could have done so but chose not to.
Maybe technically correct but, pragmatically useless. You do realize that windows is in the business of supporting lots of legacy compatibility things right ?

The likelihood of suddenly changing the kernel model at this point is just not really reasonable without an absolutely massive amount of testing and inevitable issues.

Apple can afford to make compatibility breaks, Windows really can’t because of their market share and customer base.
 
Upvote
3 (5 / -2)

steelcobra

Ars Tribunus Angusticlavius
9,774
Nope. The EU required Microsoft (and Apple) to give AV vendors access to the same API as they use. Apple wrote a system so that AV didn't have to be ring0, Microsoft didn't.
Apple also isn't as big of a target for malware creators, Microsoft likely saw critical value in being able to run Defender tasks in Ring 0 as a defense against that.
 
Upvote
-5 (0 / -5)

Great_Scott

Ars Tribunus Militum
2,266
Subscriptor
They said they did and do testing. The admission makes a distinction between testing (running code that challenges the new functions in various ways, ensuring the results are the ones expected) and incremental deployments with canaries.

You can think of deploying in waves as a test, but since you're deploying to production environments, a distinction is often (always, IME) made.

The problem with just testing is what we saw here: their tests were insufficient. Most organizations find it infeasible to prove mathematically that their testing is sufficient, so they fall back on sending updates out in waves in order to control the blast radius of a mistake. This was what Crowdstrike were missing with their so-called "Rapid Response" updates, and what they pledge to do in the future.
When I claimed that "they weren't doing any testing" in my previous post, I was implying that relying entirely on a automated testing routine isn't effective on its own.

From what I recall of all the articles I've read on the update/outage, Crowdstrike only ever directly referenced their one utility when going over the test procedures. Hopefully I'm wrong and they're being vague for security reasons, and I'd bet there's more of a focus on test procedures now.

As you mentioned, at the very least they should have been doing incremental updates. Even something as simple as "updating our own corporate resources before rolling out to customers" would help.
 
Upvote
4 (4 / 0)