CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

Rosyna

Ars Tribunus Angusticlavius
6,966
Well, obviously it didn't check correctly, because the debugger clearly shows that a pointer in the null page is attempted to be dereferenced. Either that, or the debugger is lying.
The crash dump shows that it was trying to jump to an invalid pointer address and many people incorrectly assumed, because the address was so low, that it was an offset to a NULL pointer. However, CSAgent.sys clearly checks for NULL pointers before jumping to offsets.

Which means the small value that looked like an offset was loaded directly from something else, like a serialized instance.
 
Upvote
13 (13 / 0)

Rosyna

Ars Tribunus Angusticlavius
6,966
Can a normal user really kill system level / other user's processes on windows? That OS is even more garbage than I thought...
You really don’t want to know the answer to this…

A. Windows has a hardcoded list of executable names you’re never allowed to kill.

B. In a completely separate part of the code, Windows makes sure only Microsoft executables can have those names on disk and launch.

So Windows has no way to identify if a process is special based on metadata or data inside the process/executable, it has to look at metadata in the file system to do A.
 
Upvote
10 (10 / 0)

steelcobra

Ars Tribunus Angusticlavius
9,775
Please pardon my complete ignorance.... and I don't want to blame the victims, but was there a way for organizations using the software to test the update on non-mission critical machines before allowing the update to be applied to the entire organization? We used to do that with any of the Windows updates and had them cause problems on a few occasions.
Officially? Yes, they had deployment rings the customer could set so updates like this could be tested and staged into larger parts of an org.

In reality, they hid that CS could flag an update to just ignore that and deploy to all systems anyways.
 
Upvote
16 (17 / -1)
As far as lists of things that were done wrong go, am I the only one to have a "no system updates on a Friday unless it's an emergency" rule? Who wants to stress out on a Friday, let alone work all weekend because something went wrong?

I like Tuesday mornings for updates and changes. Monday is over and you had it to prepare. Everybody is back in the swing of work. If something goes wrong, you and the team have all day to work on it and up to the end of the week before ruining their weekend.
 
Upvote
2 (4 / -2)

cbreak

Ars Praefectus
5,922
Subscriptor++
The crash dump shows that it was trying to jump to an invalid pointer address and many people incorrectly assumed, because the address was so low, that it was an offset to a NULL pointer. However, CSAgent.sys clearly checks for NULL pointers before jumping to offsets.

Which means the small value that looked like an offset was loaded directly from something else, like a serialized instance.
... or by computing an offset from null... and checking afterwards.

Regardless on how that value was arrived at, I'd classify anything that tries to dereference anything in the null page a null pointer dereferencing error.
 
Upvote
3 (4 / -1)

cbreak

Ars Praefectus
5,922
Subscriptor++
Maybe you should read better, they were forced into it as part of the EU anti-trust agreements.
No.

They were forced to not abuse their position as OS manufacturer unfairly. They could have just provided an API AND USED IT THEMSELVES TOO. That way everyone would have the same access.

But no, they didn't.

And crowdstrike isn't even bound by any agreement MS made, they could have created such an API too with a small kernel module, and used it from userland to more safely handle untrusted data.
 
Upvote
10 (14 / -4)

Rosyna

Ars Tribunus Angusticlavius
6,966
Maybe you should read better, they were forced into it as part of the EU anti-trust agreements.
No, Microsoft was forced to give third-party AV the same access as Defender*. In response, Microsoft created a new kernel-level API (which has callbacks kernel drivers register for) so EDR/AV vendors no longer had to patch kernel structures.

There was absolutely zero mandate for Microsoft to vend those callbacks inside the kernel to kernel drivers. Microsoft could have made user-space callbacks and had Defender use those, but that goes against Microsoft’s philosophy of not having a security boundary between admin and kernel.

*It wasn’t called Defender at the time, but Defender is the current name.
 
Upvote
7 (11 / -4)

cbreak

Ars Praefectus
5,922
Subscriptor++
You really don’t want to know the answer to this…

A. Windows has a hardcoded list of executable names you’re never allowed to kill.

B. In a completely separate part of the code, Windows makes sure only Microsoft executables can have those names on disk and launch.

So Windows has no way to identify if a process is special based on metadata or data inside the process/executable, it has to look at metadata in the file system to do A.
Interesting. So if users can just manipulate the processes of other users, what's even the point of logging in with different users?
 
Upvote
-6 (1 / -7)

steveftoth

Ars Scholae Palatinae
1,182
Some analysts online have shown debugging data from crash dumps and minimal reverse engineering. By their account it's a null reference to a pointer in a system driver. That's something unit testing should have easily caught ... if used.

So here is what we know.

  • Trivial error in the software, running as a system driver.
  • Insufficient testing.
  • Insufficient control over large scale rollouts.
  • Not previously sharing release notes with customers.
  • Not previously allowing customers to control timing of rollouts.
  • Not previously allowing customers to use automated staged rollouts.

As someone working with governance in Enterprise IT, I am astonished they got this big without their customers challenging these things.

It's truly a WTF moment for the industry.
TBH I'm not surprised because the whole reason most companies bought their software was because the client didn't want to worry about any of these things. You can bet that the CIOs of these client companies were willing to break their own systems rather than get IN WRITING that CrowdStrike was actually doing best practices.

Software is basically unregulated and this failure is just a taste of how fragile our ecosystem really is.
 
Upvote
7 (7 / 0)

Rosyna

Ars Tribunus Angusticlavius
6,966
Interesting. So if users can just manipulate the processes of other users, what's even the point of logging in with different users?
Because Microsoft doesn’t believe there should be a security boundary between admin and kernel, part of Microsoft is trying to convince people to only log in as non-admin* users. However, Microsoft simultaneously says the majority of users log in as admin users.

*Microsoft does believe there should be a security boundary between different low privileged users, but not between processes running as the same user.

Microsoft Security Boundaries
 
Upvote
11 (11 / 0)

ranthog

Ars Legatus Legionis
15,240
TBH I'm not surprised because the whole reason most companies bought their software was because the client didn't want to worry about any of these things. You can bet that the CIOs of these client companies were willing to break their own systems rather than get IN WRITING that CrowdStrike was actually doing best practices.

Software is basically unregulated and this failure is just a taste of how fragile our ecosystem really is.
The question is if it is gross negligence. In most jurisdictions, gross negligence can't be shifted by contract. The fact they never ran their update before pushing it live is quite frankly amazing.

I think that it isn't too hard to argue that not running an update on any computers is gross negligence. This is such a basic step in testing that it is how students test their very first programs for programming homework, where they just run it and see if it crashes.

It is just that fundamental. Nor is it a step that I'd ever willingly skip in development, even after my software has passed automated testing. I always want to test the deployment and verify that it actually works, and I've caught errors in doing that.

While software is not heavily regulated, there are best practices and standards for this stuff, just as much as there are in any other engineering field.

Edit: You could use a very simple test setup VM's that get the update first. You verify the machines can reboot. Then you roll back the update and reboot. Then you verify that it is working on the right version after reboot

You would probably want some diversity to this, including some VM's on cloud providers, but this would have been simple and cheap to implement, before the staged roll-out hits customers. I'd guess that setting up these very basic automated deployment checks wouldn't even take that much time or money.
 
Last edited:
Upvote
7 (7 / 0)

SportivoA

Ars Tribunus Militum
1,529
This very interesting post goes into the terms of service and basically concludes that "we told you not to use this software on critical systems and if you did, it's on you"

"THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION."

https://www.hackerfactor.com/blog/index.php?/archives/1038-When-the-Crowd-Strikes-Back.html
There's a huge difference between "don't hook this up to something that keeps a person alive or a turbine from exploding" and "day-to-day business can't happen if 100% of our customer-facing employees can't boot their device". If 1% or even 5% of the computers in hospitals and airports stopped working, it'd hurt, but wouldn't be an impossible-to-mitigate disaster. That's unacceptable for "hazardous" and "fail-safe" type applications in general. When none of the customer-facing employees can do their job duties to manage the customers in their location because an uncontrolled, improperly, and incompletely checked update took out EVERYTHING in the building it's different. And then you do get an impossible-to-mitigate disaster because your software vendor screwed up and never gave you control of limiting their screw-up's effect on your systems!
 
Upvote
10 (10 / 0)

cbreak

Ars Praefectus
5,922
Subscriptor++
Because Microsoft doesn’t believe there should be a security boundary between admin and kernel, part of Microsoft is trying to convince people to only log in as non-admin* users. However, Microsoft simultaneously says the majority of users log in as admin users.

*Microsoft does believe there should be a security boundary between different low privileged users, but not between processes running as the same user.

Microsoft Security Boundaries
So presumably crowdstrike would run as some system user, and not as the logged-in user. Who therefore should not have the permissions to kill it. Unless the logged-in user is an admin, who could presumably also unload kernel modules.
 
Upvote
3 (3 / 0)

markgo

Ars Praefectus
3,776
Subscriptor++
Well, obviously it didn't check correctly, because the debugger clearly shows that a pointer in the null page is attempted to be dereferenced. Either that, or the debugger is lying.
One of the blogs said it was a NULL + offset dereference, so the null check wouldn’t work. But it would still crash as all those low-value addresses are illegal.
 
Upvote
1 (1 / 0)

ranthog

Ars Legatus Legionis
15,240
One of the blogs said it was a NULL + offset dereference, so the null check wouldn’t work. But it would still crash as all those low-value addresses are illegal.
This is also why you usually at least do a test deployment to run the program and do some basic sanity checks, even when your automated tests came back green.
 
Upvote
7 (7 / 0)
D

Deleted member 330960

Guest
Well, obviously it didn't check correctly, because the debugger clearly shows that a pointer in the null page is attempted to be dereferenced. Either that, or the debugger is lying.
Rosyna is correct. It was an OOB memory read.
https://x.com/patrickwardle/status/1816051422716203416

I'm still shocked that there is no end to end testing. I guess when you think you are hot shit deploying several times a day, ain't nobody got time for that.
 
Upvote
5 (5 / 0)

Cyphase

Seniorius Lurkius
12
...

This one is just . . . they didn't stage the rollout and didn't, like, test it? WTF? I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.
It's a bit worse than that. This wasn't an update to 8.5 million machines. That's just the number that downloaded the update in question in the 78-minute window that it was live, and experienced the error. Does anyone know how many endpoints are running the sensor client?
 
Upvote
12 (12 / 0)
Ummm, shouldn't they be testing the KERNEL DRIVER to ensure that it correctly rejects malformed content? It's simply not good enough to have a "Content Validator" that is intended to spare the kernel driver from being exposed to malformed content that would cause it to crash. Maybe this validator is based on the same code as used in their driver, but if so, I hope they are planning on FIXING the driver and not just the external validator.
 
Upvote
15 (15 / 0)

NetMage

Ars Tribunus Angusticlavius
9,741
Subscriptor
As far as lists of things that were done wrong go, am I the only one to have a "no system updates on a Friday unless it's an emergency" rule? Who wants to stress out on a Friday, let alone work all weekend because something went wrong?
People that don't want to open a 48 hour window for ransomware to run on their systems every weekend.
 
Upvote
14 (14 / 0)

Ganz

Ars Scholae Palatinae
757
The option was in the standard deployment management console to have staged deployments of all patches.

Crowdstrike hid that they could flag a patch to ignore that and deploy to all anyways.
No, this was a so-called "Template" or "Rapid Response" update - a type of update that does not honor patch update controls. Customers did not have control over when their endpoints picked up this update. Their response includes an enhancement to introduce such controls in the future.
 
Upvote
10 (10 / 0)

Ganz

Ars Scholae Palatinae
757
I find the Crowdstrike admission odd.

How can there be testing bugs when they aren't doing any testing?
They said they did and do testing. The admission makes a distinction between testing (running code that challenges the new functions in various ways, ensuring the results are the ones expected) and incremental deployments with canaries.

You can think of deploying in waves as a test, but since you're deploying to production environments, a distinction is often (always, IME) made.

The problem with just testing is what we saw here: their tests were insufficient. Most organizations find it infeasible to prove mathematically that their testing is sufficient, so they fall back on sending updates out in waves in order to control the blast radius of a mistake. This was what Crowdstrike were missing with their so-called "Rapid Response" updates, and what they pledge to do in the future.
 
Upvote
-3 (3 / -6)

ranthog

Ars Legatus Legionis
15,240
They said they did and do testing. The admission makes a distinction between testing (running code that challenges the new functions in various ways, ensuring the results are the ones expected) and incremental deployments with canaries.

You can think of deploying in waves as a test, but since you're deploying to production environments, a distinction is often (always, IME) made.

The problem with just testing is what we saw here: their tests were insufficient. Most organizations find it infeasible to prove mathematically that their testing is sufficient, so they fall back on sending updates out in waves in order to control the blast radius of a mistake. This was what Crowdstrike were missing with their so-called "Rapid Response" updates, and what they pledge to do in the future.
Yes, it is impossible to prove you can cover all possible case. However, this was not some weird edge condition. This was something that most every reasonable test case was going to demonstrate the flaw. Even if they only set up 10 test VM's to automate this on with the most common configurations they'd have likely replicated the flaw in test.

The problem in the testing is that they don't even run the actual update before deploying. It appears they just test on millions of customers PC's instead of doing that internally.

Things like sending out updates in waves is what saves you from weird edge cases and odd configurations. This should have never gotten out of the test environment, as it should have caused all of their test beds to get caught in looping blue screen loop. Automated testing could have easily caught this problem in a few minutes, and would not add significant delay to the update.
 
Upvote
18 (18 / 0)

sarusa

Ars Praefectus
3,258
Subscriptor++
This is bullcrap. You can't blame the faulty patch on the automated testing. Automated testing is supposed to be an additional line of defense, not your only check of 'okay, it works'. What happened to any sort of functional testing on any machines before the dev pushed it live?

I have heard terrible things about Crowdstrike's engineers and management, including about their wild cowboy deployment (push live from your dev machine after you compile, Compiles On My Machine ™️ shrug shrug shrug) apparently all true.
 
Upvote
10 (10 / 0)

Nilt

Ars Legatus Legionis
21,810
Subscriptor++
Please pardon my complete ignorance.... and I don't want to blame the victims, but was there a way for organizations using the software to test the update on non-mission critical machines before allowing the update to be applied to the entire organization? We used to do that with any of the Windows updates and had them cause problems on a few occasions.
The real problem is even in the instances where this sort of configuration was set up, either this sort of update bypasses that entirely or a bug allowed that to happen. I've seen conflicting reports both ways and to my knowledge, this wasn't explicitly covered by CrowdStrike yet, either. The fact that they have admitted to not following other pretty standard update-pushing test practices leads me to think it's the former but that's speculation at this stage.
 
Upvote
0 (1 / -1)

henryhbk

Ars Tribunus Militum
1,952
Subscriptor++
The aspect that seems particularly alarming(and which they are not talking about) is that the kernel driver component was apparently willing to accept a malformed update on the basis of nothing but a header and then keel over and die.

Certainly having an actual testing process would be nice; but (especially when the whole point of your software is that there might be adversarial activity on the system) it seems like a deep and fundamental problem that such a high-privilege/high-criticality component is so brittle against malformed input.

Even the most impeccable testing can only assure you that your inputs won't cause your driver to misbehave; they can't assure you that you will always remain in control of which inputs your driver ends up chewing on.
Seriously:
try:
read file
except:
Read last file that worked, throw alert to someone
 
Upvote
6 (7 / -1)

sparselogic

Seniorius Lurkius
16
Subscriptor++
Someone help me out here. What is the exact dollar-value salary range where you start failing upwards?

If my one-man business was this negligent, I'd never work again. But if you get paid a fortune to run a global corp, when you fail spectacularly through sheer incompetence you just get moved to another c-suite job at a different company, and do it again. Repeat until you retire.

Look at the resumes of half these CEOs etc... and it's a trail of failure. But it's never them that suffers for it. In any sane world the consequence of failure should be higher if you get paid millions because you're supposed to be so special and important.

How much do you have to be paid before everyone suddenly decides that you don't face consequences anymore? Just curious.
I mistakenly thought The World held adults responsible for their screwups. The amount of people I’ve encountered in professional roles who just clown their way through life is… both startling and reassuring.

(And take a look at the amount of one-man trade/construction outfits who leave a mess everywhere they go and still have customers) :)
 
Upvote
8 (8 / 0)

henryhbk

Ars Tribunus Militum
1,952
Subscriptor++
I worked at a large "northern European" HQ'ed telecoms company: (not saying which one...)

In several of their divisions one of their big pushes is for every single test/QA person to be capable of writing test automation code, and if you're not capable of writing test automation code, you're on the layoff list. (or already gone)

Because "manual" QA is too slow.

The problem is: (I hope to hell all of us realize this) some of the best QA and test people I've ever worked with couldn't write a line of code to literally save their lives (or jobs) but can find bugs, can describe them and can advocate for getting them fixed, and can find stuff that the worlds best automation would never ever find.

And those were the people getting turfed. Despite creativity in finding problems, being effective advocates for customer-facing issues. "Can't write an automated test case? buh-bye"

This is for software that runs the complex networks of the world: where a mis-applied CICD pipelined blob of code will knock major infrastructure offline. (Just ask Rogers in Canada...)
You want eyeballs and a brain on some aspects of that test cycle...
I worked at Microsoft in the 80s on the multi plan team, and we had a great tester team, and we had what the time was pretty good test automation (dos batch files mostly) but one day the tester runs something and says ”huh, I seem to remember that recall being faster last release” which was true since the x87 flag got set to off so we were emulating the math coprocessor in the library. No automation suite is figuring that out as everything worked, but you had to remember from last year that tests were faster.
 
Upvote
10 (10 / 0)

henryhbk

Ars Tribunus Militum
1,952
Subscriptor++
Pilots are actually trained about "accident chains" and to keep them in mind when making decisions to try to break such chains. Any really big mistake is just the culmination of a trail of little ones. Details matter.
Well 0f course they’re in the front of the fast moving tube, so unlike the CEO they have incentives
 
Upvote
7 (7 / 0)

Chuckstar

Ars Legatus Legionis
37,249
Subscriptor
Yes, it is impossible to prove you can cover all possible case. However, this was not some weird edge condition. This was something that most every reasonable test case was going to demonstrate the flaw. Even if they only set up 10 test VM's to automate this on with the most common configurations they'd have likely replicated the flaw in test.

The problem in the testing is that they don't even run the actual update before deploying. It appears they just test on millions of customers PC's instead of doing that internally.

Things like sending out updates in waves is what saves you from weird edge cases and odd configurations. This should have never gotten out of the test environment, as it should have caused all of their test beds to get caught in looping blue screen loop. Automated testing could have easily caught this problem in a few minutes, and would not add significant delay to the update.
I look at this like not having a test of Hubble’s mirror independent of the grinding rig. You’d have to have a pretty stupid error to need such a test. ;)

It’s the classic “if we do it right, the test will be unnecessary”. But that’s true of most tests, so why not just do the easy test to make sure?
 
Upvote
8 (8 / 0)

ranthog

Ars Legatus Legionis
15,240
I look at this like not having a test of Hubble’s mirror independent of the grinding rig. You’d have to have a pretty stupid error to need such a test. ;)

It’s the classic “if we do it right, the test will be unnecessary”. But that’s true of most tests, so why not just do the easy test to make sure?
With that, making sure a mirror grinding rig works correctly is much easier.
 
Upvote
4 (4 / 0)

Ganz

Ars Scholae Palatinae
757
Yes, it is impossible to prove you can cover all possible case. However, this was not some weird edge condition. This was something that most every reasonable test case was going to demonstrate the flaw.
It's not always impossible. It's often infeasible (read: not as profitable). But anyway ...

I'm definitely not in a position to roll my eyes and say that it's obvious that tests should have caught this - I have no idea how their system works. You appear to know for sure, which is great - they should fix their dumb tests. I'm down.

All I personally can roll my eyes at is their "everything, everywhere, all at once" deployment model for these Template updates, and fixing that would certainly have caught this problem before it got to paying customers.

This is not an either/or thing, so your assertion that they should fix their obviously stupid tests does not nullify my assertion that their deployment model is bonkers crazy-go-nuts. Because it's bonkers crazy-go-nuts.
 
Upvote
7 (8 / -1)