CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

Rosyna · Jul 24, 2024

cbreak said:
Well, obviously it didn't check correctly, because the debugger clearly shows that a pointer in the null page is attempted to be dereferenced. Either that, or the debugger is lying.

The crash dump shows that it was trying to jump to an invalid pointer address and many people incorrectly assumed, because the address was so low, that it was an offset to a NULL pointer. However, CSAgent.sys clearly checks for NULL pointers before jumping to offsets.

Which means the small value that looked like an offset was loaded directly from something else, like a serialized instance.

Rosyna · Jul 24, 2024

cbreak said:
Can a normal user really kill system level / other user's processes on windows? That OS is even more garbage than I thought...

You really don’t want to know the answer to this…

A. Windows has a hardcoded list of executable names you’re never allowed to kill.

B. In a completely separate part of the code, Windows makes sure only Microsoft executables can have those names on disk and launch.

So Windows has no way to identify if a process is special based on metadata or data inside the process/executable, it has to look at metadata in the file system to do A.

steelcobra · Jul 24, 2024

cbreak said:
It was absolutely a choice. And they chose poorly. Try reading next time.

Maybe you should read better, they were forced into it as part of the EU anti-trust agreements.

steelcobra · Jul 24, 2024

granolagumbo said:
Please pardon my complete ignorance.... and I don't want to blame the victims, but was there a way for organizations using the software to test the update on non-mission critical machines before allowing the update to be applied to the entire organization? We used to do that with any of the Windows updates and had them cause problems on a few occasions.

Officially? Yes, they had deployment rings the customer could set so updates like this could be tested and staged into larger parts of an org.

In reality, they hid that CS could flag an update to just ignore that and deploy to all systems anyways.

Glorified Desktop Support · Jul 24, 2024

As far as lists of things that were done wrong go, am I the only one to have a "no system updates on a Friday unless it's an emergency" rule? Who wants to stress out on a Friday, let alone work all weekend because something went wrong?

I like Tuesday mornings for updates and changes. Monday is over and you had it to prepare. Everybody is back in the swing of work. If something goes wrong, you and the team have all day to work on it and up to the end of the week before ruining their weekend.

cbreak · Jul 24, 2024

Rosyna said:
The crash dump shows that it was trying to jump to an invalid pointer address and many people incorrectly assumed, because the address was so low, that it was an offset to a NULL pointer. However, CSAgent.sys clearly checks for NULL pointers before jumping to offsets.

Which means the small value that looked like an offset was loaded directly from something else, like a serialized instance.

... or by computing an offset from null... and checking afterwards.

Regardless on how that value was arrived at, I'd classify anything that tries to dereference anything in the null page a null pointer dereferencing error.

cbreak · Jul 24, 2024

steelcobra said:
Maybe you should read better, they were forced into it as part of the EU anti-trust agreements.

No.

They were forced to not abuse their position as OS manufacturer unfairly. They could have just provided an API AND USED IT THEMSELVES TOO. That way everyone would have the same access.

But no, they didn't.

And crowdstrike isn't even bound by any agreement MS made, they could have created such an API too with a small kernel module, and used it from userland to more safely handle untrusted data.

Power_Struggle · Jul 24, 2024

EU says that MS can’t block kernel access if MS’s own Defender still gets kernel access.
They want a level playing field between MS’s and 3-rd party AV.

If MS would make APIs and have defender use those, all would be fine.

It takes two to tango…

Rosyna · Jul 24, 2024

steelcobra said:
Maybe you should read better, they were forced into it as part of the EU anti-trust agreements.

No, Microsoft was forced to give third-party AV the same access as Defender*. In response, Microsoft created a new kernel-level API (which has callbacks kernel drivers register for) so EDR/AV vendors no longer had to patch kernel structures.

There was absolutely zero mandate for Microsoft to vend those callbacks inside the kernel to kernel drivers. Microsoft could have made user-space callbacks and had Defender use those, but that goes against Microsoft’s philosophy of not having a security boundary between admin and kernel.

*It wasn’t called Defender at the time, but Defender is the current name.

cbreak · Jul 24, 2024

Rosyna said:
You really don’t want to know the answer to this…

A. Windows has a hardcoded list of executable names you’re never allowed to kill.

B. In a completely separate part of the code, Windows makes sure only Microsoft executables can have those names on disk and launch.

So Windows has no way to identify if a process is special based on metadata or data inside the process/executable, it has to look at metadata in the file system to do A.

Interesting. So if users can just manipulate the processes of other users, what's even the point of logging in with different users?

steveftoth · Jul 24, 2024

SplatMan_DK said:
Some analysts online have shown debugging data from crash dumps and minimal reverse engineering. By their account it's a null reference to a pointer in a system driver. That's something unit testing should have easily caught ... if used.

So here is what we know.

Trivial error in the software, running as a system driver.

Insufficient testing.

Insufficient control over large scale rollouts.

Not previously sharing release notes with customers.

Not previously allowing customers to control timing of rollouts.

Not previously allowing customers to use automated staged rollouts.

As someone working with governance in Enterprise IT, I am astonished they got this big without their customers challenging these things.

It's truly a WTF moment for the industry.

TBH I'm not surprised because the whole reason most companies bought their software was because the client didn't want to worry about any of these things. You can bet that the CIOs of these client companies were willing to break their own systems rather than get IN WRITING that CrowdStrike was actually doing best practices.

Software is basically unregulated and this failure is just a taste of how fragile our ecosystem really is.

Rosyna · Jul 24, 2024

cbreak said:
Interesting. So if users can just manipulate the processes of other users, what's even the point of logging in with different users?

Because Microsoft doesn’t believe there should be a security boundary between admin and kernel, part of Microsoft is trying to convince people to only log in as non-admin* users. However, Microsoft simultaneously says the majority of users log in as admin users.

*Microsoft does believe there should be a security boundary between different low privileged users, but not between processes running as the same user.

Microsoft Security Boundaries

ranthog · Jul 24, 2024

steveftoth said:
TBH I'm not surprised because the whole reason most companies bought their software was because the client didn't want to worry about any of these things. You can bet that the CIOs of these client companies were willing to break their own systems rather than get IN WRITING that CrowdStrike was actually doing best practices.

Software is basically unregulated and this failure is just a taste of how fragile our ecosystem really is.

The question is if it is gross negligence. In most jurisdictions, gross negligence can't be shifted by contract. The fact they never ran their update before pushing it live is quite frankly amazing.

I think that it isn't too hard to argue that not running an update on any computers is gross negligence. This is such a basic step in testing that it is how students test their very first programs for programming homework, where they just run it and see if it crashes.

It is just that fundamental. Nor is it a step that I'd ever willingly skip in development, even after my software has passed automated testing. I always want to test the deployment and verify that it actually works, and I've caught errors in doing that.

While software is not heavily regulated, there are best practices and standards for this stuff, just as much as there are in any other engineering field.

Edit: You could use a very simple test setup VM's that get the update first. You verify the machines can reboot. Then you roll back the update and reboot. Then you verify that it is working on the right version after reboot

You would probably want some diversity to this, including some VM's on cloud providers, but this would have been simple and cheap to implement, before the staged roll-out hits customers. I'd guess that setting up these very basic automated deployment checks wouldn't even take that much time or money.

MrTom · Jul 24, 2024

This software is installed on 8.5M machines? Now let's get a calculator and see just how much of a subscription profit they make from this every month.

SportivoA · Jul 24, 2024

GrumpyExSpaceDude said:
This very interesting post goes into the terms of service and basically concludes that "we told you not to use this software on critical systems and if you did, it's on you"

"THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION."

https://www.hackerfactor.com/blog/index.php?/archives/1038-When-the-Crowd-Strikes-Back.html

There's a huge difference between "don't hook this up to something that keeps a person alive or a turbine from exploding" and "day-to-day business can't happen if 100% of our customer-facing employees can't boot their device". If 1% or even 5% of the computers in hospitals and airports stopped working, it'd hurt, but wouldn't be an impossible-to-mitigate disaster. That's unacceptable for "hazardous" and "fail-safe" type applications in general. When none of the customer-facing employees can do their job duties to manage the customers in their location because an uncontrolled, improperly, and incompletely checked update took out EVERYTHING in the building it's different. And then you do get an impossible-to-mitigate disaster because your software vendor screwed up and never gave you control of limiting their screw-up's effect on your systems!

cbreak · Jul 24, 2024

Rosyna said:
Because Microsoft doesn’t believe there should be a security boundary between admin and kernel, part of Microsoft is trying to convince people to only log in as non-admin* users. However, Microsoft simultaneously says the majority of users log in as admin users.

*Microsoft does believe there should be a security boundary between different low privileged users, but not between processes running as the same user.

Microsoft Security Boundaries

So presumably crowdstrike would run as some system user, and not as the logged-in user. Who therefore should not have the permissions to kill it. Unless the logged-in user is an admin, who could presumably also unload kernel modules.

markgo · Jul 24, 2024

cbreak said:
Well, obviously it didn't check correctly, because the debugger clearly shows that a pointer in the null page is attempted to be dereferenced. Either that, or the debugger is lying.

One of the blogs said it was a NULL + offset dereference, so the null check wouldn’t work. But it would still crash as all those low-value addresses are illegal.

ranthog · Jul 24, 2024

markgo said:
One of the blogs said it was a NULL + offset dereference, so the null check wouldn’t work. But it would still crash as all those low-value addresses are illegal.

This is also why you usually at least do a test deployment to run the program and do some basic sanity checks, even when your automated tests came back green.

johnd_2 · Jul 24, 2024

Still struggling with the concept that Crowdstrike has aspects designed to get around the security built into kernel-level code deployment.

Deleted member 330960 · Jul 24, 2024

cbreak said:
Well, obviously it didn't check correctly, because the debugger clearly shows that a pointer in the null page is attempted to be dereferenced. Either that, or the debugger is lying.

Rosyna is correct. It was an OOB memory read.
https://x.com/patrickwardle/status/1816051422716203416

I'm still shocked that there is no end to end testing. I guess when you think you are hot shit deploying several times a day, ain't nobody got time for that.

Cyphase · Jul 24, 2024

starglider said:
...

This one is just . . . they didn't stage the rollout and didn't, like, test it? WTF? I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.

It's a bit worse than that. This wasn't an update to 8.5 million machines. That's just the number that downloaded the update in question in the 78-minute window that it was live, and experienced the error. Does anyone know how many endpoints are running the sensor client?

philipjohnstephens · Jul 24, 2024

Ummm, shouldn't they be testing the KERNEL DRIVER to ensure that it correctly rejects malformed content? It's simply not good enough to have a "Content Validator" that is intended to spare the kernel driver from being exposed to malformed content that would cause it to crash. Maybe this validator is based on the same code as used in their driver, but if so, I hope they are planning on FIXING the driver and not just the external validator.

NetMage · Jul 24, 2024

Glorified Desktop Support said:
As far as lists of things that were done wrong go, am I the only one to have a "no system updates on a Friday unless it's an emergency" rule? Who wants to stress out on a Friday, let alone work all weekend because something went wrong?

People that don't want to open a 48 hour window for ransomware to run on their systems every weekend.

NetMage · Jul 24, 2024

MrTom said:
This software is installed on 8.5M machines?

No, that's how many downloaded the update in the 1.5 hours it was published. (Yes, CrowdStrike is slow.) We had about 20% of PCs and 3% of servers affected, so much more than 8.5M machines run CrowdStrike.

Ganz · Jul 24, 2024

steelcobra said:
The option was in the standard deployment management console to have staged deployments of all patches.

Crowdstrike hid that they could flag a patch to ignore that and deploy to all anyways.

No, this was a so-called "Template" or "Rapid Response" update - a type of update that does not honor patch update controls. Customers did not have control over when their endpoints picked up this update. Their response includes an enhancement to introduce such controls in the future.

Ganz · Jul 24, 2024

Great_Scott said:
I find the Crowdstrike admission odd.

How can there be testing bugs when they aren't doing any testing?

They said they did and do testing. The admission makes a distinction between testing (running code that challenges the new functions in various ways, ensuring the results are the ones expected) and incremental deployments with canaries.

You can think of deploying in waves as a test, but since you're deploying to production environments, a distinction is often (always, IME) made.

The problem with just testing is what we saw here: their tests were insufficient. Most organizations find it infeasible to prove mathematically that their testing is sufficient, so they fall back on sending updates out in waves in order to control the blast radius of a mistake. This was what Crowdstrike were missing with their so-called "Rapid Response" updates, and what they pledge to do in the future.

davidsco27 · Jul 24, 2024

How about they blame their incompetence?

ranthog · Jul 24, 2024

Ganz said:
They said they did and do testing. The admission makes a distinction between testing (running code that challenges the new functions in various ways, ensuring the results are the ones expected) and incremental deployments with canaries.

You can think of deploying in waves as a test, but since you're deploying to production environments, a distinction is often (always, IME) made.

The problem with just testing is what we saw here: their tests were insufficient. Most organizations find it infeasible to prove mathematically that their testing is sufficient, so they fall back on sending updates out in waves in order to control the blast radius of a mistake. This was what Crowdstrike were missing with their so-called "Rapid Response" updates, and what they pledge to do in the future.

Yes, it is impossible to prove you can cover all possible case. However, this was not some weird edge condition. This was something that most every reasonable test case was going to demonstrate the flaw. Even if they only set up 10 test VM's to automate this on with the most common configurations they'd have likely replicated the flaw in test.

The problem in the testing is that they don't even run the actual update before deploying. It appears they just test on millions of customers PC's instead of doing that internally.

Things like sending out updates in waves is what saves you from weird edge cases and odd configurations. This should have never gotten out of the test environment, as it should have caused all of their test beds to get caught in looping blue screen loop. Automated testing could have easily caught this problem in a few minutes, and would not add significant delay to the update.

ranthog · Jul 24, 2024

davidsco27 said:
How about they blame their incompetence?

I'd blame gross negligence myself.

sarusa · Jul 24, 2024

This is bullcrap. You can't blame the faulty patch on the automated testing. Automated testing is supposed to be an additional line of defense, not your only check of 'okay, it works'. What happened to any sort of functional testing on any machines before the dev pushed it live?

I have heard terrible things about Crowdstrike's engineers and management, including about their wild cowboy deployment (push live from your dev machine after you compile, Compiles On My Machine

shrug shrug shrug) apparently all true.

Nilt · Jul 24, 2024

granolagumbo said:
Please pardon my complete ignorance.... and I don't want to blame the victims, but was there a way for organizations using the software to test the update on non-mission critical machines before allowing the update to be applied to the entire organization? We used to do that with any of the Windows updates and had them cause problems on a few occasions.

The real problem is even in the instances where this sort of configuration was set up, either this sort of update bypasses that entirely or a bug allowed that to happen. I've seen conflicting reports both ways and to my knowledge, this wasn't explicitly covered by CrowdStrike yet, either. The fact that they have admitted to not following other pretty standard update-pushing test practices leads me to think it's the former but that's speculation at this stage.

henryhbk · Jul 24, 2024

fuzzyfuzzyfungus said:
The aspect that seems particularly alarming(and which they are not talking about) is that the kernel driver component was apparently willing to accept a malformed update on the basis of nothing but a header and then keel over and die.

Certainly having an actual testing process would be nice; but (especially when the whole point of your software is that there might be adversarial activity on the system) it seems like a deep and fundamental problem that such a high-privilege/high-criticality component is so brittle against malformed input.

Even the most impeccable testing can only assure you that your inputs won't cause your driver to misbehave; they can't assure you that you will always remain in control of which inputs your driver ends up chewing on.

Seriously:
try:
read file
except:
Read last file that worked, throw alert to someone

sparselogic · Jul 24, 2024

gruberduber said:
Someone help me out here. What is the exact dollar-value salary range where you start failing upwards?

If my one-man business was this negligent, I'd never work again. But if you get paid a fortune to run a global corp, when you fail spectacularly through sheer incompetence you just get moved to another c-suite job at a different company, and do it again. Repeat until you retire.

Look at the resumes of half these CEOs etc... and it's a trail of failure. But it's never them that suffers for it. In any sane world the consequence of failure should be higher if you get paid millions because you're supposed to be so special and important.

How much do you have to be paid before everyone suddenly decides that you don't face consequences anymore? Just curious.

I mistakenly thought The World held adults responsible for their screwups. The amount of people I’ve encountered in professional roles who just clown their way through life is… both startling and reassuring.

(And take a look at the amount of one-man trade/construction outfits who leave a mess everywhere they go and still have customers)

henryhbk · Jul 24, 2024

DistinctivelyCanuck said:
I worked at a large "northern European" HQ'ed telecoms company: (not saying which one...)

In several of their divisions one of their big pushes is for every single test/QA person to be capable of writing test automation code, and if you're not capable of writing test automation code, you're on the layoff list. (or already gone)

Because "manual" QA is too slow.

The problem is: (I hope to hell all of us realize this) some of the best QA and test people I've ever worked with couldn't write a line of code to literally save their lives (or jobs) but can find bugs, can describe them and can advocate for getting them fixed, and can find stuff that the worlds best automation would never ever find.

And those were the people getting turfed. Despite creativity in finding problems, being effective advocates for customer-facing issues. "Can't write an automated test case? buh-bye"

This is for software that runs the complex networks of the world: where a mis-applied CICD pipelined blob of code will knock major infrastructure offline. (Just ask Rogers in Canada...)
You want eyeballs and a brain on some aspects of that test cycle...

I worked at Microsoft in the 80s on the multi plan team, and we had a great tester team, and we had what the time was pretty good test automation (dos batch files mostly) but one day the tester runs something and says ”huh, I seem to remember that recall being faster last release” which was true since the x87 flag got set to off so we were emulating the math coprocessor in the library. No automation suite is figuring that out as everything worked, but you had to remember from last year that tests were faster.

vinyl1 · Jul 24, 2024

As I understand it, the channel files contained a bunch of named pipe address locations that pointed the kernel driver to places to look for suspicious activity. Unfortunately, the last set of addresses was all zeroes. Surely, any sort of validator should have noticed and rejected that one.

henryhbk · Jul 24, 2024

Maltz said:
Pilots are actually trained about "accident chains" and to keep them in mind when making decisions to try to break such chains. Any really big mistake is just the culmination of a trail of little ones. Details matter.

Well 0f course they’re in the front of the fast moving tube, so unlike the CEO they have incentives

Putrid Polecat · Jul 24, 2024

1000 MBAs, tech bros, and marketing droids: "See that devops engineer over there, uh, Gary? It was him. We've since executed Gary and the problem has been addressed. Your computers are once again safe with Crowdstrike."

Chuckstar · Jul 25, 2024

ranthog said:
Yes, it is impossible to prove you can cover all possible case. However, this was not some weird edge condition. This was something that most every reasonable test case was going to demonstrate the flaw. Even if they only set up 10 test VM's to automate this on with the most common configurations they'd have likely replicated the flaw in test.

The problem in the testing is that they don't even run the actual update before deploying. It appears they just test on millions of customers PC's instead of doing that internally.

Things like sending out updates in waves is what saves you from weird edge cases and odd configurations. This should have never gotten out of the test environment, as it should have caused all of their test beds to get caught in looping blue screen loop. Automated testing could have easily caught this problem in a few minutes, and would not add significant delay to the update.

I look at this like not having a test of Hubble’s mirror independent of the grinding rig. You’d have to have a pretty stupid error to need such a test.

It’s the classic “if we do it right, the test will be unnecessary”. But that’s true of most tests, so why not just do the easy test to make sure?

ranthog · Jul 25, 2024

Chuckstar said:
I look at this like not having a test of Hubble’s mirror independent of the grinding rig. You’d have to have a pretty stupid error to need such a test.

It’s the classic “if we do it right, the test will be unnecessary”. But that’s true of most tests, so why not just do the easy test to make sure?

With that, making sure a mirror grinding rig works correctly is much easier.

Ganz · Jul 25, 2024

ranthog said:
Yes, it is impossible to prove you can cover all possible case. However, this was not some weird edge condition. This was something that most every reasonable test case was going to demonstrate the flaw.

It's not always impossible. It's often infeasible (read: not as profitable). But anyway ...

I'm definitely not in a position to roll my eyes and say that it's obvious that tests should have caught this - I have no idea how their system works. You appear to know for sure, which is great - they should fix their dumb tests. I'm down.

All I personally can roll my eyes at is their "everything, everywhere, all at once" deployment model for these Template updates, and fixing that would certainly have caught this problem before it got to paying customers.

This is not an either/or thing, so your assertion that they should fix their obviously stupid tests does not nullify my assertion that their deployment model is bonkers crazy-go-nuts. Because it's bonkers crazy-go-nuts.

CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Praefectus

Ars Praefectus

Ars Praetorian

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Tribunus Militum

Ars Tribunus Militum

Ars Praefectus

Ars Praefectus

Ars Legatus Legionis

Ars Centurion

Deleted member 330960

Guest

Seniorius Lurkius

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Praetorian

Ars Legatus Legionis

Ars Legatus Legionis

Ars Praefectus

Ars Legatus Legionis

Ars Tribunus Militum

Seniorius Lurkius

Ars Tribunus Militum

Ars Praetorian

Ars Tribunus Militum

Ars Tribunus Militum

Ars Legatus Legionis

Ars Legatus Legionis

Ars Scholae Palatinae