CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

elf-stone

Smack-Fu Master, in training
97
Swiss cheese theory of holes aligning, a bad template instance shouldn't be able to hose the core sensor, but a second bug might make that possible (especially if broken content is typically weeded out at the validation phase).

https://en.m.wikipedia.org/wiki/Swiss_cheese_model
Nah, it was a comically trivial failure: just a bad file, broken validation code and a lack of testing. Swiss cheese model? There's more holes than cheese.
 
Upvote
56 (56 / 0)
Are there already organisations that will pivot to an A/B setup with two different providers for endpoint security?

Considering that this software runs at kernel level, considering that the operational impact can be huge and considering that this isn’t the first time such a thing happens?

A/B might also cover the case where the security software doesn’t protect (yet) but competing software does.
 
Upvote
7 (7 / 0)

Necranom

Wise, Aged Ars Veteran
130
Subscriptor++
At some point, there was an MBA (or 12) in the decision making matrix. I promise you they determined it would save a few bucks to ONLY use automated unit tests and cut out the actual test deployments to actual systems.

That lab and the staff to run and maintain it would have cost them a few hundred thousand dollars a year!!! Can't absorb those kinds of operating costs in a multi billion dollar company.... what would the almighty shareholders say???
 
Upvote
56 (58 / -2)

azazel1024

Ars Legatus Legionis
15,020
Subscriptor
So they admit they deployed worldwide all at once. That has been against best practices for large scale deployments for more than two decades. Sue them into oblivion.
Fun little anecdote about Crowdstrikes CEO George Kurtz.

In October 2009, McAfee promoted him to chief technology officer and executive vice president.[13] Six months later, McAfee accidentally disrupted its customers' operations around the world when it pushed out a software update that deleted critical Windows XP system files and caused affected systems to bluescreen and enter a boot loop. "I'm not sure any virus writer has ever developed a piece of malware that shut down as many machines as quickly as McAfee did today," Ed Bott wrote at ZDNet.[6]

Pulled from the Wiki article about him, but verified through ZDNet article and a couple of others I poked at.

So, not his first rodeo of insufficient testing and bad practices in a group he is leading...
 
Upvote
94 (94 / 0)

Dragonmaster Lou

Ars Scholae Palatinae
661
Subscriptor
With respect to a couple of comments here, is it even possible to have written this in a memory-safe language? This runs as Windows as a kernel driver and the Windows kernel is pretty much entirely written in C. I haven't written Windows kernel code in 20+ years, so I could be way out of date, but I don't think you can write a Windows kernel driver in Rust or any other memory-safe language, at least not yet.

Second, while the crash was caused by dereferencing a bad pointer (from the excellent crash dump analysis done by David Plummer, it looks like they were adding an offset to a null pointer and dereferencing that), I'm not sure a memory safe language would've necessarily solved the problem here. It's entirely possible that since the file was invalid/corrupt, it may have triggered some other bad behavior even if a memory safe language prevented them from messing with the bad pointer.

Also, everyone, including CrowdStrike, seems to be missing the most important part: your driver itself should validate its input before doing anything with it. The kernel mode component that loads this should do its own validation prior to doing anything else with the file. I don't care what kinds of validation tools you run back in the home base before it's distributed. Any number of bad things could go wrong between when you tested it and when it gets out to a customer's system. In the end, it's up to the software running on the customer's system to validate the data and do something like log the error instead of blue screening the system. Now, their response says that they plan in "improving" the validation done in the "Content Interpreter." Frankly, it looks like whatever validation they did there was nowhere near up to snuff and I'm not going to hold my breath that they're going to do it right this time.
 
Upvote
65 (67 / -2)
Are there already organisations that will pivot to an A/B setup with two different providers for endpoint security?

Considering that this software runs at kernel level, considering that the operational impact can be huge and considering that this isn’t the first time such a thing happens?

A/B might also cover the case where the security software doesn’t protect (yet) but competing software does.

I wouldn't be at all surprised to seem some vendor switching; and perhaps some segregation of systems along functional lines(so that you don't, say, lose your check-in kiosks and your SQL servers to the same issue); but (at least if one takes the claims of EDR vendors remotely seriously) just slotting in a mix of systems is going to be really, really, tricky.

While they do typically do classical antivirus stuff, quarantining the low-effort known malicious stuff that comes in; the fancy special sauce you are paying for is mostly anomaly detection and correlation(across endpoints; and often with other sources, eg. Palo Alto's integration of Cortex endpoint with Prisma Access IDS) that intends to detect threatening or suspect behavior even in absence of anything for which detection signatures exist.

If you start going with different systems on different endpoints you cut down the pool across which any one can detect anomalies; and, while any credible product will support SIEM integration, it won't necessarily support it at the granularity of "literally every fiddly little thing the internal model chews on"; and even if it does it's still then up to you/your SIEM vendor to draw the correlations across signals coming in from different endpoint sensors; which is not a trivial operation.
 
Upvote
12 (12 / 0)

sword_9mm

Ars Legatus Legionis
25,726
Subscriptor
At some point, there was an MBA (or 12) in the decision making matrix. I promise you they determined it would save a few bucks to ONLY use automated unit tests and cut out the actual test deployments to actual systems.

That lab and the staff to run and maintain it would have cost them a few hundred thousand dollars a year!!! Can't absorb those kinds of operating costs in a multi billion dollar company.... what would the almighty shareholders say???
We have a customer now that's trying to automate all QA to save money.

I chuckle. Whatever. At least I'm not dealing with their idiot asses.
 
Upvote
35 (35 / 0)
At some point, there was an MBA (or 12) in the decision making matrix. I promise you they determined it would save a few bucks to ONLY use automated unit tests and cut out the actual test deployments to actual systems.

That lab and the staff to run and maintain it would have cost them a few hundred thousand dollars a year!!! Can't absorb those kinds of operating costs in a multi billion dollar company.... what would the almighty shareholders say???
QA/QC doesn't return tangible value on the quarterly report so it must be a waste.
 
Upvote
44 (44 / 0)

afidel

Ars Legatus Legionis
18,164
Subscriptor
I'd be happy if they allowed you to specify which update was applied to a system, so rapid response updates go immediately to QA and then a day later go to prod, applying updates at midnight to QA and 11:59PM to prod might work to minimize the chance of bugs impacting most systems but with prod spread over almost every time zone it might not as well.
 
Upvote
6 (6 / 0)

Ganz

Ars Scholae Palatinae
757
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
From the response:

Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.
They should have been doing this already. This is what the lawsuits should hinge on.
 
Upvote
26 (26 / 0)

Tomcat From Mars

Ars Centurion
273
Subscriptor
tldr "we'll prevent this by using bog standard industry practices that are literally taught in schools."

Then they give out $10 Uber Eats gift cards as a "thank you" and then even those don't work because they themselves cancelled them after issuance. Can't even roll out a fucking gift card.

Just astonishing.


Edit: a little anecdote from me, I recently interviewed there for an infrastructure/platform engineering position. Development pipeline was literally something my team would own (for that specific area/team not whole company). I went through the full round and got no offer, but it was a huge red flag to me that the hiring manager, when asked why he liked working at CrowdStrike, wouldn't shut up about the "stock price." He seemed not very interested in much beyond the fact that it was "fast growing." I suspect many managers were there for the IPO and don't give a fuck as long as they get their payout. To be fair it was an overall good interview process with no BS and people I know do like working there. But still...
I'm waiting to get some more details on the Uber Eats thing. So far is there is no official statement and just a few people reporting it on twitter. To me it smells like a joke or a scam.
 
Upvote
15 (15 / 0)

gruberduber

Wise, Aged Ars Veteran
149
Fun little anecdote about Crowdstrikes CEO George Kurtz.

In October 2009, McAfee promoted him to chief technology officer and executive vice president.[13] Six months later, McAfee accidentally disrupted its customers' operations around the world when it pushed out a software update that deleted critical Windows XP system files and caused affected systems to bluescreen and enter a boot loop. "I'm not sure any virus writer has ever developed a piece of malware that shut down as many machines as quickly as McAfee did today," Ed Bott wrote at ZDNet.[6]

Pulled from the Wiki article about him, but verified through ZDNet article and a couple of others I poked at.

So, not his first rodeo of insufficient testing and bad practices in a group he is leading...
Someone help me out here. What is the exact dollar-value salary range where you start failing upwards?

If my one-man business was this negligent, I'd never work again. But if you get paid a fortune to run a global corp, when you fail spectacularly through sheer incompetence you just get moved to another c-suite job at a different company, and do it again. Repeat until you retire.

Look at the resumes of half these CEOs etc... and it's a trail of failure. But it's never them that suffers for it. In any sane world the consequence of failure should be higher if you get paid millions because you're supposed to be so special and important.

How much do you have to be paid before everyone suddenly decides that you don't face consequences anymore? Just curious.
 
Upvote
58 (58 / 0)

drewcoo

Wise, Aged Ars Veteran
134
Yea, you can't just rely on unit tests and validators and sub-module checks, etc. They are good practice but are insufficient to ship. Seems like someone didn't learn that very important lesson in their coding bootcamp.

The embarrassing thing ? I'm a fucking product manager who doesn't write a line of code and I know this. It's so basic that the "dumb product guys" who don't understand all the details of engineering devops get it.
I read this more as "they're blaming the testing teams."
Which makes sense, considering they're generally hired to take the blame when something goes wrong.

So we have situations like this where the single point of failure is clearly the team hired to take the blame for failures. /s
 
Upvote
8 (8 / 0)

Dark Pumpkin

Ars Scholae Palatinae
1,187
<quote>to allow its software to "gather telemetry on possible novel threat techniques."</quote>
deployed a broadening data collection update at midnight to all devices.. this deserves a deeper dive as well.

Here's the Deep Dive analysis of what that means:

This is something anti-virus programs already do to help them detect new threats that haven't been entered into a virus database yet. This wasn't an update to add that capability to CrowdStrike, but rather to update the code involved in that capability.
 
Upvote
8 (9 / -1)

GrumpyExSpaceDude

Smack-Fu Master, in training
93
Ok so is the company liable for any of the downstream damages? (I feel like I know the answer to this.)
This very interesting post goes into the terms of service and basically concludes that "we told you not to use this software on critical systems and if you did, it's on you"

"THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION."

https://www.hackerfactor.com/blog/index.php?/archives/1038-When-the-Crowd-Strikes-Back.html
 
Upvote
18 (19 / -1)

SGJ

Ars Praetorian
519
Subscriptor++
...

Even the most impeccable testing can only assure you that your inputs won't cause your driver to misbehave; they can't assure you that you will always remain in control of which inputs your driver ends up chewing on.
True, but fuzzing would have greatly increased the likelihood of finding the problem before it caused global chaos. The have accepted this as they are now promising to do this in the future.
 
Upvote
13 (13 / 0)

DistinctivelyCanuck

Ars Tribunus Militum
2,677
Subscriptor
We have a customer now that's trying to automate all QA to save money.

I chuckle. Whatever. At least I'm not dealing with their idiot asses.
I worked at a large "northern European" HQ'ed telecoms company: (not saying which one...)

In several of their divisions one of their big pushes is for every single test/QA person to be capable of writing test automation code, and if you're not capable of writing test automation code, you're on the layoff list. (or already gone)

Because "manual" QA is too slow.

The problem is: (I hope to hell all of us realize this) some of the best QA and test people I've ever worked with couldn't write a line of code to literally save their lives (or jobs) but can find bugs, can describe them and can advocate for getting them fixed, and can find stuff that the worlds best automation would never ever find.

And those were the people getting turfed. Despite creativity in finding problems, being effective advocates for customer-facing issues. "Can't write an automated test case? buh-bye"

This is for software that runs the complex networks of the world: where a mis-applied CICD pipelined blob of code will knock major infrastructure offline. (Just ask Rogers in Canada...)
You want eyeballs and a brain on some aspects of that test cycle...
 
Upvote
45 (45 / 0)

steelcobra

Ars Tribunus Angusticlavius
9,775
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
This exactly, if they'd tested it on even a single VM/metal windows machine they'd have noticed the BSODs.

But I think the bigger sin is still that they think it's OK to have a patch flag that ignores client-deployed staging rings and pushes a patch to all devices so that it can't be isolated at the company side too.
 
Upvote
16 (16 / 0)

evan_s

Ars Tribunus Angusticlavius
7,314
Subscriptor
These content configuration updates sound like they are basically virus definition files. If your software is so crappy that a bad data file like that can crash the system then your software doesn't sound very good. I wonder if this same crash would be exploitable as a denial of service attack by crashing the machines or possibly even a Root level code execution? I won't be at all surprised if this is followed up by one or both of those things.
 
Upvote
14 (16 / -2)

steelcobra

Ars Tribunus Angusticlavius
9,775
Upvote
-6 (8 / -14)

ranthog

Ars Legatus Legionis
15,240
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
The worst part is you could very easily automate roll out to the test farm. You just need a test side to deploy the thing to a bunch of common configurations in VM's and maybe some on bare metal.

Once automated testing is done, you can then test the rollback mechanism.
 
Upvote
9 (9 / 0)

steelcobra

Ars Tribunus Angusticlavius
9,775
From the response:


They should have been doing this already. This is what the lawsuits should hinge on.
The option was in the standard deployment management console to have staged deployments of all patches.

Crowdstrike hid that they could flag a patch to ignore that and deploy to all anyways.
 
Upvote
10 (12 / -2)