CrowdStrike fixes start at “reboot up to 15 times” and get more complex from there

Peevester · Jul 20, 2024

stormcrash said:
A design choice that every OS makes in one form or another. Certain things need to run at a privileged level to work

And malware monitoring is one of those, either by kernel APIs or direct memory access under a system privilege. You can't monitor what an operating system's internals are doing from userspace.

That's why third-party antivirus and similar tools are a bad idea - there is always going to be some disconnect between the host OS's software and what the AV THINKS the host's OS is. Of course, that doesn't stop FIRST party AV from causing problems either, but at least there's some chance of things being on the same page.

A small part of crowdstrike's concept is decent - a tool that uses a huge base of users to monitor and set alarms for bad behaviors seen out in the world has a good chance of seeing an attack early on. But detection and direct action are two different things.

EDIT: Others brought up a good point that only the minimum functionality that requires it should be in kernel space, and that's absolutely right. Since this thing crashed doing nothing that couldn't be done in userspace, there's really no excuse other than laziness and the lack of good integration testing.

schnackenpfefferhausen · Jul 20, 2024

Emon said:
Sure they are, because hiring people to do things is expensive, so they buy a "turnkey solution" from a vendor like CrowdStrike who assure them that everything will work flawlessly. The CrowdStrike salesperson convinces the ignorant execs to put their eggs in one basket. The product checks all the boxes and the ignorant execs are too stupid to realize the sales people are exploiting their ignorance.

BigFire · Jul 20, 2024

tackhouse1 said:
I'm wondering how CrowdStrike as a company fares from this issue? Stock is currently down 10%

Got to see who shorted them just prior.

dragongoddess · Jul 20, 2024

adamsc said:
This bypassed that mechanism. The problem for security products is that attackers can adjust far more rapidly than most IT departments so there’s a bias to ship updates to the definitions (not the code, the patterns it looks for) as quickly as possible. Unfortunately, while their customers are expecting them to have rigorous testing and robust code, Crowdstrike let them down badly. This should have been caught before it shipped, the code should have validated it better, and they should have followed the best practices of the 1980s to disable something causing repeated failures.

They do not do a real world test of the code?

BigFire · Jul 20, 2024

mygeek911 said:
Awaiting the sequel: The Crowd Strikes Back

This IS the sequel. Crowdstrike's CEO's previous rodeo was with MacAfee that went so poorly the company was sold to Intel.

Fuzzypiggy · Jul 20, 2024

jasonridesabike said:
I feel for all those poor souls responsible for fixing this on their network, many of whom will be personally blamed by non-technical higher ups (at least at first). I've been there.

Comes with the territory and that's part of why we in IT get paid so well. We get paid sometimes to simply take all the shit, shut up and with any luck, dream of early retirement.

kkono · Jul 20, 2024

Microsoft needs to come down hard on CrowdStrike - clean up your Q/A or we ban your binaries from our systems.

Computing has become so integrated into critical systems (hospitals, transportation, utilities, banking, 911) that if this sort of thing keeps happening, the government will be pressured to start regulating software like they do with drug approval, building codes, environmental regulations, fcc, etc; and I don’t think anyone wants that.

fuzzyfuzzyfungus · Jul 20, 2024

kkono said:
Microsoft needs to come down hard on CrowdStrike - clean up your Q/A or we ban your binaries from our systems.

Computing has become so integrated into critical systems (hospitals, transportation, utilities, banking, 911) that if this sort of thing keeps happening, the government will be pressured to start regulating software like they do with drug approval, building codes, environmental regulations, fcc, etc; and I don’t think anyone wants that.

That's an...awkward...proposal when Microsoft has a directly competing suite of various Defender for X (endpoint, cloud, "defender XDR", etc.)

Not to say that Crowdstrke doesn't deserve a considerable beating for this episode; but when MS runs their own 'security' products group as a distinct, 20-billion odd dollar enterprise(not just an internal team that is integrated with their various other product groups in an attempt to do more secure development); it would be a fairly alarming trip to antitrust town if they just started going nuclear on their competitors using their control of authenticode signing requirements; even if the competitor in question is really asking for it at a given moment.

SeanJW · Jul 20, 2024

Stern said:
If Hector Martin's analysis is correct, Crowdstrike do file parsing in kernel space, and the driver shat itself on a malformed update file. I thought Tavis Ormandy shamed security companies into not doing stupid shit like that years ago.

You have to do some parsing in kernel space to validate what comes from the user side of things. Of course, it should be as simple and bullet-proof as possible to avoid shitting the bed - if you're dealing with untrusted input, user-land should be crunching it down to something sane to throw across the fence and the kernel side just goes "yeah, that won't make me explode" or rejecting it if it would.

aapis · Jul 20, 2024

I don’t know about you, but I require 15 reboots to start and stop all the software I write. I’m reminded of the immortal words of John ComputerScience, the founder of computers, “if at first you don’t succeed, turn it off and on again”.

Lee L · Jul 20, 2024

wigry said:
Seems like an easy backdoor straight to the worlds critical infrastructure. This time it exploded but next time it might be a keylogger or ransomware or whatever sneaky code remaining hidden and doing its thing in the background. Classical supply chain vulnerability and thats quite scary proposition.

Isn’t that the way the Solarwinds hack worked?

xoe · Jul 20, 2024

Not a perfect use of

www.xkcd.com/2347

but still pretty apt.

JinxOnYou · Jul 20, 2024

I actually applied for a QA position at Crowdstrike a little while ago. Red flags in the interview, where the hiring manager said they were 'badly under-resourced', and had just been through a bad couple of months where they were doing tons of overtime.

Very glad now that I didn't end up in that job.

LieutenantLefse · Jul 20, 2024

Davidoff said:
Well, they are already known as 'ClownStrike' due their propensity of releasing showstopper updates. The last one was back in April when they borked Linux systems with a defective Falcon sensor update, quite similar to this one.

Do you have a link or more info on that? I'm finding it impossible to search for amid the hundreds of articles on the current debacle.

Navalia Vigilate · Jul 20, 2024

In the early days of a vulnerability scanning company that is popular today, we scanned the R&D, Research, Lab, and Corporate networks up to six times per day to ensure that nothing would break. The Lab was specifically full of N, N-1, N-2 versions of major operating systems and tier 1 applications and then down to as low at N versions of 2nd tier and some 3rd tier OS's and applications. The Internet side was scanned twice per day. The IT group workstations and a few servers had some destructive plugins turned on including printer plugins.

We did not have customer problems and we pushed out scanning plugin updates ad-hoc. This outage was not necessary. It was a choice.

TechBumbler · Jul 20, 2024

I'm not in any way a security or developer expert. But it just occurred to me that security software companies should have a step in the middle if they use push updates to customer computers. Like a preview that happens with Browserstack. For Crowdstrike it could be one or a small set of computers that get the update first.

If that first update step causes no problems, then the next step can happen to push to the other computers at the organization. I know not all computers are the same, so as with Browserstack, the organization can configure a "push test" OS virtual machine for each OS that they manage. One each for the client machines, servers, etc.

I also know the Devil is in the details, but anything like this would go a long way towards stopping the worst disasters like the one that just happened. My basic point is that testing at the source is not enough. There needs to be a guinea pig computer in between Crowdstrike-style security apps and the customer computer population.

Is there anything like this already in use?

audincli9 · Jul 20, 2024

50me12 said:
Pic stolen from reddit of those pour souls fixing things:
View: https://co.reddit.com/r/delta/comments/1e73d0r/manual_bitlocker_recovery_on_every_machine/

Ouch!

richgroot · Jul 20, 2024

I can't begin to estimate the number of "This can't ever happen again" emails that are being drafted right now by CxOs all over the world.

Trondal · Jul 20, 2024

markgo said:
“I have returned my bonus from last year and have dedicated the company’s financial reserves to make all affected customers whole again”, he continued.

/jk

Not that he won’t stay rich no matter what happens from here but he almost certainly has millions of wealth tied up in their stock.

So he did indeed feel this, at least a bit.

marsilies · Jul 20, 2024

gru said:
Can somebody tell me why most of the corporate world installed an AV (which is basically a spyware) from a company established by a russki?

Maybe because Dmitri Alperovitch, the one co-founder you seem to be referring to, came over the US in 1994, when he was 14 and is a naturalized citizen of the US, so there's no reason to think he's a Russian agent.

https://en.wikipedia.org/wiki/Dmitri_Alperovitch

Decoherent · Jul 20, 2024

arcite · Jul 20, 2024

CrowdStrike outage affected 8.5 million Windows devices, Microsoft says

adamsc · Jul 20, 2024

dragongoddess said:
They do not do a real world test of the code?

That’s the $65,536 question: they should test that heavily before shipping it but this instantly failed on so many systems that it calls into question whether they actually do. Simply launching a clean Windows build should have caught such a gross failure.

philipjohnstephens · Jul 20, 2024

drkstar82 said:
As someone who has used Crowdstrike at two jobs now for about 8 years this is the first and only major issue i have seen from them. Unfortunately its a massive issue.

It makes me wonder how long their faulty parser code has been in their kernel driver. It sounds like it was time bomb just waiting to go off. It also makes me wonder how CrowdStrike can be trusted given that they couldn't even write a hardened parser designed to run in kernel mode. Did they not write any unit tests to verify that the parser wouldn't fall over when faced with malformed input? This is Computer Science 101 stuff, and CrowdStrike failed the course.

sjl · Jul 20, 2024

mohnish82 said:
Why do they allow their OS to be crippled by a defective driver?

It's a universal problem, not limited to Microsoft. A bad kernel module will kill a Linux install as well. A driver, pretty much by definition, has to run with kernel level privileges; and at that level, a mistake in the code cannot be trapped - it's going to bring the system down.

Some things that are currently kernel modules can be moved to userspace - but some things cannot. (And doing so does bring certain tradeoffs - for example, GPU drivers can be in userspace, but there is a performance hit in doing so. Given how complex GPU drivers are these days, that's a worthwhile tradeoff IMO; but it is a tradeoff.)

CrowdStrike made the choice - rightly or wrongly - to implement their code as a kernel level driver. Their code caused these crashes. Ergo, CrowdStrike is wholly to blame for this. Microsoft might be able to implement improvements that allow more stuff that's currently in the kernel to move into userspace - but that's a separate issue.

If you write stuff that runs with kernel-level privileges, it is on you to make sure your code is robust. There is only so much that the OS vendor can do to limit the damage in that scenario.

philipjohnstephens · Jul 20, 2024

dwsdwsdws said:
Minimizing the amount of kernel-mode code is a well-known security best practice, but evidently not at CrowdStrike.

It is quite frightening that we don't know just how badly written these systems are until a major breakage occurs. But it's unfortunately also not surprising in the least.

Nexus · Jul 20, 2024

Koga Onutalepo said:
If you can get into recovery/safe mode command prompt and into your C drive:

Code:

C: cd .\Windows\System32\Drivers\CrowdStrike del C-00000291*.sys

Great, unless you have company required bitlocker or winmagic or other drive encryption across all your servers and have to put in the 64 character security key across 500+ servers.

Decoherent · Jul 20, 2024

NetMage said:
Turns out you can bypass needing the recovery key by going into Windows RE and skipping the Bitlocker prompts and then use bcdedit to turn on safe boot then let it reboot. You will boot into safe mode you can login with a local admin account and delete the file.

No. This is wrong. Nation-states and gigantic tech companies use Bitlocker, and it would never, ever have seen the light of day if there was such a trivial work-around.

alansh42 · Jul 20, 2024

Decoherent said:
No. This is wrong. Nation-states and gigantic tech companies use Bitlocker, and it would never, ever have seen the light of day if there was such a trivial work-around.

This still requires authentication with a local administrator account. The goal of Bitlocker is to deny access to someone without credentials, like attaching the drive to another system or using a boot thumb drive.

sjl · Jul 20, 2024

peppeddu said:
Who the eff allows automatic updates on live production systems?
Lots of sysadmin needs to be fired.

Speaking as a former sysadmin - you’re grossly ignorant.

Software updates? Sure, put them through testing; but some updates (major security fixes) have to be expedited.

But this wasn’t a software update. It was a definitions file update. That’s a very low risk, and it needs to be done quickly.

The fact that the data was malformed? If the driver had been properly written, it would have been rejected. But the driver is buggy, so it crashed.

No sysadmin could have expected or reasonably defended against this. You have to assume a basic level of competence from the vendor, and that assumption is what broke. And when the software does the update automatically, with no option to delay, the responsibility falls squarely upon the vendor. This was the case here.

You’ve obviously never had responsibility for a decent number of production servers. I have. And there’s no way that I’d blame the sysadmins for this.

Deleted member 388703 · Jul 20, 2024

Nexus said:
Great, unless you have company required bitlocker or winmagic or other drive encryption across all your servers and have to put in the 64 character security key across 500+ servers.

Yeah. Luckily, we had Active Directory able to recover most keys for those with permissions to them.

Extra fun if the PC has mutiple stored bitlocker keys for us to hand-type.

alansh42 · Jul 20, 2024

I assume someone will analyze the crash dumps and see what the actual bug was. I'm kinda curious.

Chuckstar · Jul 20, 2024

philipjohnstephens said:
It makes me wonder how long their faulty parser code has been in their kernel driver. It sounds like it was time bomb just waiting to go off. It also makes me wonder how CrowdStrike can be trusted given that they couldn't even write a hardened parser designed to run in kernel mode. Did they not write any unit tests to verify that the parser wouldn't fall over when faced with malformed input? This is Computer Science 101 stuff, and CrowdStrike failed the course.

I keep going back and forth in my mind on this. I think ultimately my feelings on the matter would depend on what the QA process actually was. On the one hand, best-in-class QA for drivers even theoretically cannot 100% ensure against kernel panics. On the other hand, shitty QA almost guarantees kernel panics. The existence of this kernel panic cannot by itself tell us whether it was a fluke or inevitable.

There certainly does seem to be a problem with their testing of the definitions files, though, since one would imagine anti-virus companies testing definitions files before rolling them out. Did they have no system in place to identify that new definitions files were crashing machines?

EDIT: If the kernel panics only occurred with some relatively low-percentage type of configuration, I might have some sympathy for CrowdStrike, as you can only do so much testing on definitions before rolling them out, especially given the time-sensitive nature. But it sounds like every Windows machine with that driver version crashes on these definitions. How could that not get picked up in testing?

odikweos · Jul 20, 2024

perholmes said:
This underscores what a terrifying responsibility it is to push out updates. I'm basically shaking when we push out updates to our product, especially because iOS/Android deployments are essentially impossible to debug. At least on desktop, we can get people to go delete a file. We can't even do that on mobile. We rely on a witches brew of safe modes.

I can't tell if CrowdStrike were sloppy in their testing. But in all likelihood, they just tested on systems that were a little too perfectly configured, and when it hit the real world, it exploded. And maybe their rollout wasn't tiered enough.

My sympathies. Having your code be a core driver on many of the world's systems is as awesome as that responsibility can get.

Second this. Anyone who's ever worked on a kernel mode driver sympathizes with these folks.. but at the end of the day that's what they get paid for.

el_oscuro · Jul 20, 2024

Civitello said:
Not a perfect use of

www.xkcd.com/2347

but still pretty apt.
View attachment 85806

The irony is that about 10 years ago, Image Magick did break and brought our website to it knees, affecting 100s of thousands users.

CrowdStrike fixes start at “reboot up to 15 times” and get more complex from there

Account Banned

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Praefectus

Ars Scholae Palatinae

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Legatus Legionis

Ars Legatus Legionis

Account Banned

Ars Praefectus

Ars Scholae Palatinae

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Praefectus

Seniorius Lurkius

Wise, Aged Ars Veteran

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Account Banned

CrowdStrike outage affected 8.5 million Windows devices, Microsoft says ​

Ars Praefectus

Wise, Aged Ars Veteran

Ars Praefectus

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Praefectus

Deleted member 388703

Guest

Ars Praefectus

Ars Legatus Legionis

Ars Praefectus

Ars Praefectus

CrowdStrike outage affected 8.5 million Windows devices, Microsoft says