CrowdStrike fixes start at “reboot up to 15 times” and get more complex from there

Post content hidden for low score. Show…
A design choice that every OS makes in one form or another. Certain things need to run at a privileged level to work
And malware monitoring is one of those, either by kernel APIs or direct memory access under a system privilege. You can't monitor what an operating system's internals are doing from userspace.

That's why third-party antivirus and similar tools are a bad idea - there is always going to be some disconnect between the host OS's software and what the AV THINKS the host's OS is. Of course, that doesn't stop FIRST party AV from causing problems either, but at least there's some chance of things being on the same page.

A small part of crowdstrike's concept is decent - a tool that uses a huge base of users to monitor and set alarms for bad behaviors seen out in the world has a good chance of seeing an attack early on. But detection and direct action are two different things.

EDIT: Others brought up a good point that only the minimum functionality that requires it should be in kernel space, and that's absolutely right. Since this thing crashed doing nothing that couldn't be done in userspace, there's really no excuse other than laziness and the lack of good integration testing.
 
Last edited:
Upvote
13 (15 / -2)
Sure they are, because hiring people to do things is expensive, so they buy a "turnkey solution" from a vendor like CrowdStrike who assure them that everything will work flawlessly. The CrowdStrike salesperson convinces the ignorant execs to put their eggs in one basket. The product checks all the boxes and the ignorant execs are too stupid to realize the sales people are exploiting their ignorance.
1721486242593.png
 
Upvote
-9 (3 / -12)
This bypassed that mechanism. The problem for security products is that attackers can adjust far more rapidly than most IT departments so there’s a bias to ship updates to the definitions (not the code, the patterns it looks for) as quickly as possible. Unfortunately, while their customers are expecting them to have rigorous testing and robust code, Crowdstrike let them down badly. This should have been caught before it shipped, the code should have validated it better, and they should have followed the best practices of the 1980s to disable something causing repeated failures.
They do not do a real world test of the code?
 
Upvote
2 (2 / 0)

Fuzzypiggy

Ars Scholae Palatinae
1,108
I feel for all those poor souls responsible for fixing this on their network, many of whom will be personally blamed by non-technical higher ups (at least at first). I've been there.

Comes with the territory and that's part of why we in IT get paid so well. We get paid sometimes to simply take all the shit, shut up and with any luck, dream of early retirement.
 
Upvote
14 (14 / 0)

kkono

Smack-Fu Master, in training
50
Microsoft needs to come down hard on CrowdStrike - clean up your Q/A or we ban your binaries from our systems.

Computing has become so integrated into critical systems (hospitals, transportation, utilities, banking, 911) that if this sort of thing keeps happening, the government will be pressured to start regulating software like they do with drug approval, building codes, environmental regulations, fcc, etc; and I don’t think anyone wants that.
 
Upvote
-6 (4 / -10)
Microsoft needs to come down hard on CrowdStrike - clean up your Q/A or we ban your binaries from our systems.

Computing has become so integrated into critical systems (hospitals, transportation, utilities, banking, 911) that if this sort of thing keeps happening, the government will be pressured to start regulating software like they do with drug approval, building codes, environmental regulations, fcc, etc; and I don’t think anyone wants that.

That's an...awkward...proposal when Microsoft has a directly competing suite of various Defender for X (endpoint, cloud, "defender XDR", etc.)

Not to say that Crowdstrke doesn't deserve a considerable beating for this episode; but when MS runs their own 'security' products group as a distinct, 20-billion odd dollar enterprise(not just an internal team that is integrated with their various other product groups in an attempt to do more secure development); it would be a fairly alarming trip to antitrust town if they just started going nuclear on their competitors using their control of authenticode signing requirements; even if the competitor in question is really asking for it at a given moment.
 
Upvote
29 (29 / 0)

SeanJW

Ars Legatus Legionis
11,979
Subscriptor++
If Hector Martin's analysis is correct, Crowdstrike do file parsing in kernel space, and the driver shat itself on a malformed update file. I thought Tavis Ormandy shamed security companies into not doing stupid shit like that years ago.
You have to do some parsing in kernel space to validate what comes from the user side of things. Of course, it should be as simple and bullet-proof as possible to avoid shitting the bed - if you're dealing with untrusted input, user-land should be crunching it down to something sane to throw across the fence and the kernel side just goes "yeah, that won't make me explode" or rejecting it if it would.
 
Upvote
15 (16 / -1)

Lee L

Ars Praefectus
3,572
Subscriptor++
Seems like an easy backdoor straight to the worlds critical infrastructure. This time it exploded but next time it might be a keylogger or ransomware or whatever sneaky code remaining hidden and doing its thing in the background. Classical supply chain vulnerability and thats quite scary proposition.
Isn’t that the way the Solarwinds hack worked?
 
Upvote
3 (3 / 0)

JinxOnYou

Smack-Fu Master, in training
12
I actually applied for a QA position at Crowdstrike a little while ago. Red flags in the interview, where the hiring manager said they were 'badly under-resourced', and had just been through a bad couple of months where they were doing tons of overtime.

Very glad now that I didn't end up in that job.
 
Upvote
34 (34 / 0)

LieutenantLefse

Ars Scholae Palatinae
1,166
Subscriptor++
Well, they are already known as 'ClownStrike' due their propensity of releasing showstopper updates. The last one was back in April when they borked Linux systems with a defective Falcon sensor update, quite similar to this one.

Do you have a link or more info on that? I'm finding it impossible to search for amid the hundreds of articles on the current debacle.
 
Upvote
5 (5 / 0)

Navalia Vigilate

Ars Praefectus
3,180
Subscriptor++
In the early days of a vulnerability scanning company that is popular today, we scanned the R&D, Research, Lab, and Corporate networks up to six times per day to ensure that nothing would break. The Lab was specifically full of N, N-1, N-2 versions of major operating systems and tier 1 applications and then down to as low at N versions of 2nd tier and some 3rd tier OS's and applications. The Internet side was scanned twice per day. The IT group workstations and a few servers had some destructive plugins turned on including printer plugins.

We did not have customer problems and we pushed out scanning plugin updates ad-hoc. This outage was not necessary. It was a choice.
 
Upvote
15 (15 / 0)
I'm not in any way a security or developer expert. But it just occurred to me that security software companies should have a step in the middle if they use push updates to customer computers. Like a preview that happens with Browserstack. For Crowdstrike it could be one or a small set of computers that get the update first.

If that first update step causes no problems, then the next step can happen to push to the other computers at the organization. I know not all computers are the same, so as with Browserstack, the organization can configure a "push test" OS virtual machine for each OS that they manage. One each for the client machines, servers, etc.

I also know the Devil is in the details, but anything like this would go a long way towards stopping the worst disasters like the one that just happened. My basic point is that testing at the source is not enough. There needs to be a guinea pig computer in between Crowdstrike-style security apps and the customer computer population.

Is there anything like this already in use?
 
Upvote
0 (3 / -3)
Post content hidden for low score. Show…

Trondal

Ars Scholae Palatinae
961
Subscriptor
“I have returned my bonus from last year and have dedicated the company’s financial reserves to make all affected customers whole again”, he continued.

/jk
Not that he won’t stay rich no matter what happens from here but he almost certainly has millions of wealth tied up in their stock.

So he did indeed feel this, at least a bit.
 
Upvote
4 (5 / -1)

marsilies

Ars Legatus Legionis
24,539
Subscriptor++
Can somebody tell me why most of the corporate world installed an AV (which is basically a spyware) from a company established by a russki?
Maybe because Dmitri Alperovitch, the one co-founder you seem to be referring to, came over the US in 1994, when he was 14 and is a naturalized citizen of the US, so there's no reason to think he's a Russian agent.

https://en.wikipedia.org/wiki/Dmitri_Alperovitch
 
Upvote
22 (24 / -2)

adamsc

Ars Praefectus
4,303
Subscriptor++
They do not do a real world test of the code?

That’s the $65,536 question: they should test that heavily before shipping it but this instantly failed on so many systems that it calls into question whether they actually do. Simply launching a clean Windows build should have caught such a gross failure.
 
Upvote
16 (16 / 0)
Post content hidden for low score. Show…
As someone who has used Crowdstrike at two jobs now for about 8 years this is the first and only major issue i have seen from them. Unfortunately its a massive issue.
It makes me wonder how long their faulty parser code has been in their kernel driver. It sounds like it was time bomb just waiting to go off. It also makes me wonder how CrowdStrike can be trusted given that they couldn't even write a hardened parser designed to run in kernel mode. Did they not write any unit tests to verify that the parser wouldn't fall over when faced with malformed input? This is Computer Science 101 stuff, and CrowdStrike failed the course.
 
Upvote
15 (16 / -1)
Why do they allow their OS to be crippled by a defective driver?
It's a universal problem, not limited to Microsoft. A bad kernel module will kill a Linux install as well. A driver, pretty much by definition, has to run with kernel level privileges; and at that level, a mistake in the code cannot be trapped - it's going to bring the system down.

Some things that are currently kernel modules can be moved to userspace - but some things cannot. (And doing so does bring certain tradeoffs - for example, GPU drivers can be in userspace, but there is a performance hit in doing so. Given how complex GPU drivers are these days, that's a worthwhile tradeoff IMO; but it is a tradeoff.)

CrowdStrike made the choice - rightly or wrongly - to implement their code as a kernel level driver. Their code caused these crashes. Ergo, CrowdStrike is wholly to blame for this. Microsoft might be able to implement improvements that allow more stuff that's currently in the kernel to move into userspace - but that's a separate issue.

If you write stuff that runs with kernel-level privileges, it is on you to make sure your code is robust. There is only so much that the OS vendor can do to limit the damage in that scenario.
 
Upvote
30 (30 / 0)
Minimizing the amount of kernel-mode code is a well-known security best practice, but evidently not at CrowdStrike.
It is quite frightening that we don't know just how badly written these systems are until a major breakage occurs. But it's unfortunately also not surprising in the least.
 
Upvote
15 (15 / 0)
Post content hidden for low score. Show…

Nexus

Ars Tribunus Angusticlavius
9,398
If you can get into recovery/safe mode command prompt and into your C drive:

Code:
C:
cd .\Windows\System32\Drivers\CrowdStrike
del C-00000291*.sys
Great, unless you have company required bitlocker or winmagic or other drive encryption across all your servers and have to put in the 64 character security key across 500+ servers.
 
Upvote
16 (16 / 0)

Decoherent

Ars Tribunus Angusticlavius
7,805
Subscriptor++
Turns out you can bypass needing the recovery key by going into Windows RE and skipping the Bitlocker prompts and then use bcdedit to turn on safe boot then let it reboot. You will boot into safe mode you can login with a local admin account and delete the file.
No. This is wrong. Nation-states and gigantic tech companies use Bitlocker, and it would never, ever have seen the light of day if there was such a trivial work-around.
 
Upvote
-15 (0 / -15)

alansh42

Ars Praefectus
3,672
Subscriptor++
No. This is wrong. Nation-states and gigantic tech companies use Bitlocker, and it would never, ever have seen the light of day if there was such a trivial work-around.
This still requires authentication with a local administrator account. The goal of Bitlocker is to deny access to someone without credentials, like attaching the drive to another system or using a boot thumb drive.
 
Last edited:
Upvote
24 (24 / 0)
Who the eff allows automatic updates on live production systems?
Lots of sysadmin needs to be fired.
Speaking as a former sysadmin - you’re grossly ignorant.

Software updates? Sure, put them through testing; but some updates (major security fixes) have to be expedited.

But this wasn’t a software update. It was a definitions file update. That’s a very low risk, and it needs to be done quickly.

The fact that the data was malformed? If the driver had been properly written, it would have been rejected. But the driver is buggy, so it crashed.

No sysadmin could have expected or reasonably defended against this. You have to assume a basic level of competence from the vendor, and that assumption is what broke. And when the software does the update automatically, with no option to delay, the responsibility falls squarely upon the vendor. This was the case here.

You’ve obviously never had responsibility for a decent number of production servers. I have. And there’s no way that I’d blame the sysadmins for this.
 
Upvote
46 (46 / 0)
D

Deleted member 388703

Guest
Great, unless you have company required bitlocker or winmagic or other drive encryption across all your servers and have to put in the 64 character security key across 500+ servers.
Yeah. Luckily, we had Active Directory able to recover most keys for those with permissions to them.

Extra fun if the PC has mutiple stored bitlocker keys for us to hand-type.
 
Upvote
4 (4 / 0)

Chuckstar

Ars Legatus Legionis
37,478
Subscriptor
It makes me wonder how long their faulty parser code has been in their kernel driver. It sounds like it was time bomb just waiting to go off. It also makes me wonder how CrowdStrike can be trusted given that they couldn't even write a hardened parser designed to run in kernel mode. Did they not write any unit tests to verify that the parser wouldn't fall over when faced with malformed input? This is Computer Science 101 stuff, and CrowdStrike failed the course.
I keep going back and forth in my mind on this. I think ultimately my feelings on the matter would depend on what the QA process actually was. On the one hand, best-in-class QA for drivers even theoretically cannot 100% ensure against kernel panics. On the other hand, shitty QA almost guarantees kernel panics. The existence of this kernel panic cannot by itself tell us whether it was a fluke or inevitable.

There certainly does seem to be a problem with their testing of the definitions files, though, since one would imagine anti-virus companies testing definitions files before rolling them out. Did they have no system in place to identify that new definitions files were crashing machines?

EDIT: If the kernel panics only occurred with some relatively low-percentage type of configuration, I might have some sympathy for CrowdStrike, as you can only do so much testing on definitions before rolling them out, especially given the time-sensitive nature. But it sounds like every Windows machine with that driver version crashes on these definitions. How could that not get picked up in testing?
 
Last edited:
Upvote
18 (18 / 0)
This underscores what a terrifying responsibility it is to push out updates. I'm basically shaking when we push out updates to our product, especially because iOS/Android deployments are essentially impossible to debug. At least on desktop, we can get people to go delete a file. We can't even do that on mobile. We rely on a witches brew of safe modes.

I can't tell if CrowdStrike were sloppy in their testing. But in all likelihood, they just tested on systems that were a little too perfectly configured, and when it hit the real world, it exploded. And maybe their rollout wasn't tiered enough.

My sympathies. Having your code be a core driver on many of the world's systems is as awesome as that responsibility can get.
Second this. Anyone who's ever worked on a kernel mode driver sympathizes with these folks.. but at the end of the day that's what they get paid for.
 
Upvote
8 (8 / 0)
Post content hidden for low score. Show…