CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

willdude · Jul 24, 2024

Robscura said:
Who Validates the Validator?

I dunno - Coast Guard?

Dzov · Jul 24, 2024

Maybe having, oh I don't know, maybe a file hash to make sure the file isn't corrupt would be helpful to their Content Validator?

mexaly · Jul 24, 2024

So CS Anti-virus scans itself?

elf-stone · Jul 24, 2024

grimmm said:
Swiss cheese theory of holes aligning, a bad template instance shouldn't be able to hose the core sensor, but a second bug might make that possible (especially if broken content is typically weeded out at the validation phase).

https://en.m.wikipedia.org/wiki/Swiss_cheese_model

Nah, it was a comically trivial failure: just a bad file, broken validation code and a lack of testing. Swiss cheese model? There's more holes than cheese.

barich · Jul 24, 2024

When I was in IT, I put more care than this in rolling out updates to a small (<100 device) organization.

Maarten · Jul 24, 2024

It seems that they still have an unchecked NULL pointer dereference in the main code. Even with the faulty "content configuration update" that code should not crash.

Power_Struggle · Jul 24, 2024

Are there already organisations that will pivot to an A/B setup with two different providers for endpoint security?

Considering that this software runs at kernel level, considering that the operational impact can be huge and considering that this isn’t the first time such a thing happens?

A/B might also cover the case where the security software doesn’t protect (yet) but competing software does.

c94 · Jul 24, 2024

So what they’re saying is they need a Content Validator Validator.

Emon · Jul 24, 2024

Robscura said:
Who Validates the Validator?

An underpaid contractor

Necranom · Jul 24, 2024

At some point, there was an MBA (or 12) in the decision making matrix. I promise you they determined it would save a few bucks to ONLY use automated unit tests and cut out the actual test deployments to actual systems.

That lab and the staff to run and maintain it would have cost them a few hundred thousand dollars a year!!! Can't absorb those kinds of operating costs in a multi billion dollar company.... what would the almighty shareholders say???

Dzov · Jul 24, 2024

Emon said:
An underpaid contractor

in some other country making $1/day.

Modus_Derperandi · Jul 24, 2024

Customers will also be able to subscribe to release notes about these updates.

We can get patch notes now??

When can I finger a plan file?

azazel1024 · Jul 24, 2024

markgo said:
So they admit they deployed worldwide all at once. That has been against best practices for large scale deployments for more than two decades. Sue them into oblivion.

Fun little anecdote about Crowdstrikes CEO George Kurtz.

In October 2009, McAfee promoted him to chief technology officer and executive vice president.[13] Six months later, McAfee accidentally disrupted its customers' operations around the world when it pushed out a software update that deleted critical Windows XP system files and caused affected systems to bluescreen and enter a boot loop. "I'm not sure any virus writer has ever developed a piece of malware that shut down as many machines as quickly as McAfee did today," Ed Bott wrote at ZDNet.[6]

Pulled from the Wiki article about him, but verified through ZDNet article and a couple of others I poked at.

So, not his first rodeo of insufficient testing and bad practices in a group he is leading...

Dragonmaster Lou · Jul 24, 2024

With respect to a couple of comments here, is it even possible to have written this in a memory-safe language? This runs as Windows as a kernel driver and the Windows kernel is pretty much entirely written in C. I haven't written Windows kernel code in 20+ years, so I could be way out of date, but I don't think you can write a Windows kernel driver in Rust or any other memory-safe language, at least not yet.

Second, while the crash was caused by dereferencing a bad pointer (from the excellent crash dump analysis done by David Plummer, it looks like they were adding an offset to a null pointer and dereferencing that), I'm not sure a memory safe language would've necessarily solved the problem here. It's entirely possible that since the file was invalid/corrupt, it may have triggered some other bad behavior even if a memory safe language prevented them from messing with the bad pointer.

Also, everyone, including CrowdStrike, seems to be missing the most important part: your driver itself should validate its input before doing anything with it. The kernel mode component that loads this should do its own validation prior to doing anything else with the file. I don't care what kinds of validation tools you run back in the home base before it's distributed. Any number of bad things could go wrong between when you tested it and when it gets out to a customer's system. In the end, it's up to the software running on the customer's system to validate the data and do something like log the error instead of blue screening the system. Now, their response says that they plan in "improving" the validation done in the "Content Interpreter." Frankly, it looks like whatever validation they did there was nowhere near up to snuff and I'm not going to hold my breath that they're going to do it right this time.

fuzzyfuzzyfungus · Jul 24, 2024

Power_Struggle said:
Are there already organisations that will pivot to an A/B setup with two different providers for endpoint security?

Considering that this software runs at kernel level, considering that the operational impact can be huge and considering that this isn’t the first time such a thing happens?

A/B might also cover the case where the security software doesn’t protect (yet) but competing software does.

I wouldn't be at all surprised to seem some vendor switching; and perhaps some segregation of systems along functional lines(so that you don't, say, lose your check-in kiosks and your SQL servers to the same issue); but (at least if one takes the claims of EDR vendors remotely seriously) just slotting in a mix of systems is going to be really, really, tricky.

While they do typically do classical antivirus stuff, quarantining the low-effort known malicious stuff that comes in; the fancy special sauce you are paying for is mostly anomaly detection and correlation(across endpoints; and often with other sources, eg. Palo Alto's integration of Cortex endpoint with Prisma Access IDS) that intends to detect threatening or suspect behavior even in absence of anything for which detection signatures exist.

If you start going with different systems on different endpoints you cut down the pool across which any one can detect anomalies; and, while any credible product will support SIEM integration, it won't necessarily support it at the granularity of "literally every fiddly little thing the internal model chews on"; and even if it does it's still then up to you/your SIEM vendor to draw the correlations across signals coming in from different endpoint sensors; which is not a trivial operation.

sword_9mm · Jul 24, 2024

Necranom said:
At some point, there was an MBA (or 12) in the decision making matrix. I promise you they determined it would save a few bucks to ONLY use automated unit tests and cut out the actual test deployments to actual systems.

That lab and the staff to run and maintain it would have cost them a few hundred thousand dollars a year!!! Can't absorb those kinds of operating costs in a multi billion dollar company.... what would the almighty shareholders say???

We have a customer now that's trying to automate all QA to save money.

I chuckle. Whatever. At least I'm not dealing with their idiot asses.

Mechjaz · Jul 24, 2024

cyberfunk said:
this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.

And literally likely some lives, too.

mir-teiwaz · Jul 24, 2024

To be fair, not testing is a bug.

MedicinalGoat · Jul 24, 2024

Necranom said:
At some point, there was an MBA (or 12) in the decision making matrix. I promise you they determined it would save a few bucks to ONLY use automated unit tests and cut out the actual test deployments to actual systems.

That lab and the staff to run and maintain it would have cost them a few hundred thousand dollars a year!!! Can't absorb those kinds of operating costs in a multi billion dollar company.... what would the almighty shareholders say???

QA/QC doesn't return tangible value on the quarterly report so it must be a waste.

afidel · Jul 24, 2024

I'd be happy if they allowed you to specify which update was applied to a system, so rapid response updates go immediately to QA and then a day later go to prod, applying updates at midnight to QA and 11:59PM to prod might work to minimize the chance of bugs impacting most systems but with prod spread over almost every time zone it might not as well.

RickRoyLeonPrisZhoraRacha · Jul 24, 2024

Microsoft is passing blame at EU for its policies...from 2009! Touche mon ami!

lordcheeto · Jul 24, 2024

They should be bringing in independent auditors.

mgdvt · Jul 24, 2024

They could have tested on ANY computer they owned...they could have purchased one at any store. One computer is all they needed...but no....

Claverhouse · Jul 24, 2024

CrowdStrike says it is making changes to its testing and deployment processes to prevent something like this from happening again.

That seems a very good idea.

Cloudgazer · Jul 24, 2024

It's always the testers fault - it's the engineer's job to put bugs into the code and it's their job to find them.

/s. (sort of).

Ganz · Jul 24, 2024

cyberfunk said:
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.

From the response:

Implement a staggered deployment strategy for Rapid Response Content in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.

They should have been doing this already. This is what the lawsuits should hinge on.

Tomcat From Mars · Jul 24, 2024

Emon said:
tldr "we'll prevent this by using bog standard industry practices that are literally taught in schools."

Then they give out $10 Uber Eats gift cards as a "thank you" and then even those don't work because they themselves cancelled them after issuance. Can't even roll out a fucking gift card.

Just astonishing.

Edit: a little anecdote from me, I recently interviewed there for an infrastructure/platform engineering position. Development pipeline was literally something my team would own (for that specific area/team not whole company). I went through the full round and got no offer, but it was a huge red flag to me that the hiring manager, when asked why he liked working at CrowdStrike, wouldn't shut up about the "stock price." He seemed not very interested in much beyond the fact that it was "fast growing." I suspect many managers were there for the IPO and don't give a fuck as long as they get their payout. To be fair it was an overall good interview process with no BS and people I know do like working there. But still...

I'm waiting to get some more details on the Uber Eats thing. So far is there is no official statement and just a few people reporting it on twitter. To me it smells like a joke or a scam.

gruberduber · Jul 24, 2024

azazel1024 said:
Fun little anecdote about Crowdstrikes CEO George Kurtz.

In October 2009, McAfee promoted him to chief technology officer and executive vice president.[13] Six months later, McAfee accidentally disrupted its customers' operations around the world when it pushed out a software update that deleted critical Windows XP system files and caused affected systems to bluescreen and enter a boot loop. "I'm not sure any virus writer has ever developed a piece of malware that shut down as many machines as quickly as McAfee did today," Ed Bott wrote at ZDNet.[6]

Pulled from the Wiki article about him, but verified through ZDNet article and a couple of others I poked at.

So, not his first rodeo of insufficient testing and bad practices in a group he is leading...

Someone help me out here. What is the exact dollar-value salary range where you start failing upwards?

If my one-man business was this negligent, I'd never work again. But if you get paid a fortune to run a global corp, when you fail spectacularly through sheer incompetence you just get moved to another c-suite job at a different company, and do it again. Repeat until you retire.

Look at the resumes of half these CEOs etc... and it's a trail of failure. But it's never them that suffers for it. In any sane world the consequence of failure should be higher if you get paid millions because you're supposed to be so special and important.

How much do you have to be paid before everyone suddenly decides that you don't face consequences anymore? Just curious.

drewcoo · Jul 24, 2024

cyberfunk said:
Yea, you can't just rely on unit tests and validators and sub-module checks, etc. They are good practice but are insufficient to ship. Seems like someone didn't learn that very important lesson in their coding bootcamp.

The embarrassing thing ? I'm a fucking product manager who doesn't write a line of code and I know this. It's so basic that the "dumb product guys" who don't understand all the details of engineering devops get it.

I read this more as "they're blaming the testing teams."
Which makes sense, considering they're generally hired to take the blame when something goes wrong.

So we have situations like this where the single point of failure is clearly the team hired to take the blame for failures. /s

Dark Pumpkin · Jul 24, 2024

drlava said:
<quote>to allow its software to "gather telemetry on possible novel threat techniques."</quote>
deployed a broadening data collection update at midnight to all devices.. this deserves a deeper dive as well.

Here's the Deep Dive analysis of what that means:

This is something anti-virus programs already do to help them detect new threats that haven't been entered into a virus database yet. This wasn't an update to add that capability to CrowdStrike, but rather to update the code involved in that capability.

GrumpyExSpaceDude · Jul 24, 2024

rbirling said:
Ok so is the company liable for any of the downstream damages? (I feel like I know the answer to this.)

This very interesting post goes into the terms of service and basically concludes that "we told you not to use this software on critical systems and if you did, it's on you"

"THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION."

https://www.hackerfactor.com/blog/index.php?/archives/1038-When-the-Crowd-Strikes-Back.html

SGJ · Jul 24, 2024

fuzzyfuzzyfungus said:
...

Even the most impeccable testing can only assure you that your inputs won't cause your driver to misbehave; they can't assure you that you will always remain in control of which inputs your driver ends up chewing on.

True, but fuzzing would have greatly increased the likelihood of finding the problem before it caused global chaos. The have accepted this as they are now promising to do this in the future.

Gandoron · Jul 24, 2024

No SDP is just crazy. This wasn't an emergency patch for a zero-day.

DistinctivelyCanuck · Jul 24, 2024

sword_9mm said:
We have a customer now that's trying to automate all QA to save money.

I chuckle. Whatever. At least I'm not dealing with their idiot asses.

I worked at a large "northern European" HQ'ed telecoms company: (not saying which one...)

In several of their divisions one of their big pushes is for every single test/QA person to be capable of writing test automation code, and if you're not capable of writing test automation code, you're on the layoff list. (or already gone)

Because "manual" QA is too slow.

The problem is: (I hope to hell all of us realize this) some of the best QA and test people I've ever worked with couldn't write a line of code to literally save their lives (or jobs) but can find bugs, can describe them and can advocate for getting them fixed, and can find stuff that the worlds best automation would never ever find.

And those were the people getting turfed. Despite creativity in finding problems, being effective advocates for customer-facing issues. "Can't write an automated test case? buh-bye"

This is for software that runs the complex networks of the world: where a mis-applied CICD pipelined blob of code will knock major infrastructure offline. (Just ask Rogers in Canada...)
You want eyeballs and a brain on some aspects of that test cycle...

steelcobra · Jul 24, 2024

cyberfunk said:
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.

This exactly, if they'd tested it on even a single VM/metal windows machine they'd have noticed the BSODs.

But I think the bigger sin is still that they think it's OK to have a patch flag that ignores client-deployed staging rings and pushes a patch to all devices so that it can't be isolated at the company side too.

evan_s · Jul 24, 2024

These content configuration updates sound like they are basically virus definition files. If your software is so crappy that a bad data file like that can crash the system then your software doesn't sound very good. I wonder if this same crash would be exploitable as a denial of service attack by crashing the machines or possibly even a Root level code execution? I won't be at all surprised if this is followed up by one or both of those things.

rizzo420 · Jul 24, 2024

Yeah like 'Installing it on live computers before we deploy'...you know, the simplest possible solution.

steelcobra · Jul 24, 2024

Sheep Disorder said:
As been addressed far too many times, it's laziness and incompetence not a necessary evil. Some that should also be attributed to Microsoft especially OS design.

No, on this Microsoft is stuck, and beholden to their EU anti-trust agreements from the oughts forcing them to allow 3rd party software to operate like this.
https://www.tomshardware.com/softwa...oid-crowdstrike-like-calamities-in-the-future

ranthog · Jul 24, 2024

cyberfunk said:
How on earth did they not:

1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?

It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.

this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.

The worst part is you could very easily automate roll out to the test farm. You just need a test side to deploy the thing to a bunch of common configurations in VM's and maybe some on bare metal.

Once automated testing is done, you can then test the rollback mechanism.

steelcobra · Jul 24, 2024

Ganz said:
From the response:

They should have been doing this already. This is what the lawsuits should hinge on.

The option was in the standard deployment management console to have staged deployments of all patches.

Crowdstrike hid that they could flag a patch to ignore that and deploy to all anyways.

CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Legatus Legionis

Ars Tribunus Militum

Ars Praetorian

Smack-Fu Master, in training

Account Banned

Wise, Aged Ars Veteran

Ars Legatus Legionis

Smack-Fu Master, in training

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Legatus Legionis

Ars Praefectus

Ars Praetorian

Ars Centurion

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Praefectus

Seniorius Lurkius

Ars Centurion

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Centurion

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Praetorian

Wise, Aged Ars Veteran

Ars Tribunus Militum

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Centurion

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Tribunus Angusticlavius