The crash dump shows that it was trying to jump to an invalid pointer address and many people incorrectly assumed, because the address was so low, that it was an offset to a NULL pointer. However, CSAgent.sys clearly checks for NULL pointers before jumping to offsets.Well, obviously it didn't check correctly, because the debugger clearly shows that a pointer in the null page is attempted to be dereferenced. Either that, or the debugger is lying.
You really don’t want to know the answer to this…Can a normal user really kill system level / other user's processes on windows? That OS is even more garbage than I thought...
Maybe you should read better, they were forced into it as part of the EU anti-trust agreements.It was absolutely a choice. And they chose poorly. Try reading next time.
Officially? Yes, they had deployment rings the customer could set so updates like this could be tested and staged into larger parts of an org.Please pardon my complete ignorance.... and I don't want to blame the victims, but was there a way for organizations using the software to test the update on non-mission critical machines before allowing the update to be applied to the entire organization? We used to do that with any of the Windows updates and had them cause problems on a few occasions.
... or by computing an offset from null... and checking afterwards.The crash dump shows that it was trying to jump to an invalid pointer address and many people incorrectly assumed, because the address was so low, that it was an offset to a NULL pointer. However, CSAgent.sys clearly checks for NULL pointers before jumping to offsets.
Which means the small value that looked like an offset was loaded directly from something else, like a serialized instance.
No.Maybe you should read better, they were forced into it as part of the EU anti-trust agreements.
No, Microsoft was forced to give third-party AV the same access as Defender*. In response, Microsoft created a new kernel-level API (which has callbacks kernel drivers register for) so EDR/AV vendors no longer had to patch kernel structures.Maybe you should read better, they were forced into it as part of the EU anti-trust agreements.
Interesting. So if users can just manipulate the processes of other users, what's even the point of logging in with different users?You really don’t want to know the answer to this…
A. Windows has a hardcoded list of executable names you’re never allowed to kill.
B. In a completely separate part of the code, Windows makes sure only Microsoft executables can have those names on disk and launch.
So Windows has no way to identify if a process is special based on metadata or data inside the process/executable, it has to look at metadata in the file system to do A.
TBH I'm not surprised because the whole reason most companies bought their software was because the client didn't want to worry about any of these things. You can bet that the CIOs of these client companies were willing to break their own systems rather than get IN WRITING that CrowdStrike was actually doing best practices.Some analysts online have shown debugging data from crash dumps and minimal reverse engineering. By their account it's a null reference to a pointer in a system driver. That's something unit testing should have easily caught ... if used.
So here is what we know.
- Trivial error in the software, running as a system driver.
- Insufficient testing.
- Insufficient control over large scale rollouts.
- Not previously sharing release notes with customers.
- Not previously allowing customers to control timing of rollouts.
- Not previously allowing customers to use automated staged rollouts.
As someone working with governance in Enterprise IT, I am astonished they got this big without their customers challenging these things.
It's truly a WTF moment for the industry.
Because Microsoft doesn’t believe there should be a security boundary between admin and kernel, part of Microsoft is trying to convince people to only log in as non-admin* users. However, Microsoft simultaneously says the majority of users log in as admin users.Interesting. So if users can just manipulate the processes of other users, what's even the point of logging in with different users?
The question is if it is gross negligence. In most jurisdictions, gross negligence can't be shifted by contract. The fact they never ran their update before pushing it live is quite frankly amazing.TBH I'm not surprised because the whole reason most companies bought their software was because the client didn't want to worry about any of these things. You can bet that the CIOs of these client companies were willing to break their own systems rather than get IN WRITING that CrowdStrike was actually doing best practices.
Software is basically unregulated and this failure is just a taste of how fragile our ecosystem really is.
There's a huge difference between "don't hook this up to something that keeps a person alive or a turbine from exploding" and "day-to-day business can't happen if 100% of our customer-facing employees can't boot their device". If 1% or even 5% of the computers in hospitals and airports stopped working, it'd hurt, but wouldn't be an impossible-to-mitigate disaster. That's unacceptable for "hazardous" and "fail-safe" type applications in general. When none of the customer-facing employees can do their job duties to manage the customers in their location because an uncontrolled, improperly, and incompletely checked update took out EVERYTHING in the building it's different. And then you do get an impossible-to-mitigate disaster because your software vendor screwed up and never gave you control of limiting their screw-up's effect on your systems!This very interesting post goes into the terms of service and basically concludes that "we told you not to use this software on critical systems and if you did, it's on you"
"THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION."
https://www.hackerfactor.com/blog/index.php?/archives/1038-When-the-Crowd-Strikes-Back.html
So presumably crowdstrike would run as some system user, and not as the logged-in user. Who therefore should not have the permissions to kill it. Unless the logged-in user is an admin, who could presumably also unload kernel modules.Because Microsoft doesn’t believe there should be a security boundary between admin and kernel, part of Microsoft is trying to convince people to only log in as non-admin* users. However, Microsoft simultaneously says the majority of users log in as admin users.
*Microsoft does believe there should be a security boundary between different low privileged users, but not between processes running as the same user.
Microsoft Security Boundaries
One of the blogs said it was a NULL + offset dereference, so the null check wouldn’t work. But it would still crash as all those low-value addresses are illegal.Well, obviously it didn't check correctly, because the debugger clearly shows that a pointer in the null page is attempted to be dereferenced. Either that, or the debugger is lying.
This is also why you usually at least do a test deployment to run the program and do some basic sanity checks, even when your automated tests came back green.One of the blogs said it was a NULL + offset dereference, so the null check wouldn’t work. But it would still crash as all those low-value addresses are illegal.
Rosyna is correct. It was an OOB memory read.Well, obviously it didn't check correctly, because the debugger clearly shows that a pointer in the null page is attempted to be dereferenced. Either that, or the debugger is lying.
It's a bit worse than that. This wasn't an update to 8.5 million machines. That's just the number that downloaded the update in question in the 78-minute window that it was live, and experienced the error. Does anyone know how many endpoints are running the sensor client?...
This one is just . . . they didn't stage the rollout and didn't, like, test it? WTF? I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.
People that don't want to open a 48 hour window for ransomware to run on their systems every weekend.As far as lists of things that were done wrong go, am I the only one to have a "no system updates on a Friday unless it's an emergency" rule? Who wants to stress out on a Friday, let alone work all weekend because something went wrong?
No, that's how many downloaded the update in the 1.5 hours it was published. (Yes, CrowdStrike is slow.) We had about 20% of PCs and 3% of servers affected, so much more than 8.5M machines run CrowdStrike.This software is installed on 8.5M machines?
No, this was a so-called "Template" or "Rapid Response" update - a type of update that does not honor patch update controls. Customers did not have control over when their endpoints picked up this update. Their response includes an enhancement to introduce such controls in the future.The option was in the standard deployment management console to have staged deployments of all patches.
Crowdstrike hid that they could flag a patch to ignore that and deploy to all anyways.
They said they did and do testing. The admission makes a distinction between testing (running code that challenges the new functions in various ways, ensuring the results are the ones expected) and incremental deployments with canaries.I find the Crowdstrike admission odd.
How can there be testing bugs when they aren't doing any testing?
Yes, it is impossible to prove you can cover all possible case. However, this was not some weird edge condition. This was something that most every reasonable test case was going to demonstrate the flaw. Even if they only set up 10 test VM's to automate this on with the most common configurations they'd have likely replicated the flaw in test.They said they did and do testing. The admission makes a distinction between testing (running code that challenges the new functions in various ways, ensuring the results are the ones expected) and incremental deployments with canaries.
You can think of deploying in waves as a test, but since you're deploying to production environments, a distinction is often (always, IME) made.
The problem with just testing is what we saw here: their tests were insufficient. Most organizations find it infeasible to prove mathematically that their testing is sufficient, so they fall back on sending updates out in waves in order to control the blast radius of a mistake. This was what Crowdstrike were missing with their so-called "Rapid Response" updates, and what they pledge to do in the future.
I'd blame gross negligence myself.How about they blame their incompetence?
The real problem is even in the instances where this sort of configuration was set up, either this sort of update bypasses that entirely or a bug allowed that to happen. I've seen conflicting reports both ways and to my knowledge, this wasn't explicitly covered by CrowdStrike yet, either. The fact that they have admitted to not following other pretty standard update-pushing test practices leads me to think it's the former but that's speculation at this stage.Please pardon my complete ignorance.... and I don't want to blame the victims, but was there a way for organizations using the software to test the update on non-mission critical machines before allowing the update to be applied to the entire organization? We used to do that with any of the Windows updates and had them cause problems on a few occasions.
Seriously:The aspect that seems particularly alarming(and which they are not talking about) is that the kernel driver component was apparently willing to accept a malformed update on the basis of nothing but a header and then keel over and die.
Certainly having an actual testing process would be nice; but (especially when the whole point of your software is that there might be adversarial activity on the system) it seems like a deep and fundamental problem that such a high-privilege/high-criticality component is so brittle against malformed input.
Even the most impeccable testing can only assure you that your inputs won't cause your driver to misbehave; they can't assure you that you will always remain in control of which inputs your driver ends up chewing on.
I mistakenly thought The World held adults responsible for their screwups. The amount of people I’ve encountered in professional roles who just clown their way through life is… both startling and reassuring.Someone help me out here. What is the exact dollar-value salary range where you start failing upwards?
If my one-man business was this negligent, I'd never work again. But if you get paid a fortune to run a global corp, when you fail spectacularly through sheer incompetence you just get moved to another c-suite job at a different company, and do it again. Repeat until you retire.
Look at the resumes of half these CEOs etc... and it's a trail of failure. But it's never them that suffers for it. In any sane world the consequence of failure should be higher if you get paid millions because you're supposed to be so special and important.
How much do you have to be paid before everyone suddenly decides that you don't face consequences anymore? Just curious.
I worked at Microsoft in the 80s on the multi plan team, and we had a great tester team, and we had what the time was pretty good test automation (dos batch files mostly) but one day the tester runs something and says ”huh, I seem to remember that recall being faster last release” which was true since the x87 flag got set to off so we were emulating the math coprocessor in the library. No automation suite is figuring that out as everything worked, but you had to remember from last year that tests were faster.I worked at a large "northern European" HQ'ed telecoms company: (not saying which one...)
In several of their divisions one of their big pushes is for every single test/QA person to be capable of writing test automation code, and if you're not capable of writing test automation code, you're on the layoff list. (or already gone)
Because "manual" QA is too slow.
The problem is: (I hope to hell all of us realize this) some of the best QA and test people I've ever worked with couldn't write a line of code to literally save their lives (or jobs) but can find bugs, can describe them and can advocate for getting them fixed, and can find stuff that the worlds best automation would never ever find.
And those were the people getting turfed. Despite creativity in finding problems, being effective advocates for customer-facing issues. "Can't write an automated test case? buh-bye"
This is for software that runs the complex networks of the world: where a mis-applied CICD pipelined blob of code will knock major infrastructure offline. (Just ask Rogers in Canada...)
You want eyeballs and a brain on some aspects of that test cycle...
Well 0f course they’re in the front of the fast moving tube, so unlike the CEO they have incentivesPilots are actually trained about "accident chains" and to keep them in mind when making decisions to try to break such chains. Any really big mistake is just the culmination of a trail of little ones. Details matter.
I look at this like not having a test of Hubble’s mirror independent of the grinding rig. You’d have to have a pretty stupid error to need such a test.Yes, it is impossible to prove you can cover all possible case. However, this was not some weird edge condition. This was something that most every reasonable test case was going to demonstrate the flaw. Even if they only set up 10 test VM's to automate this on with the most common configurations they'd have likely replicated the flaw in test.
The problem in the testing is that they don't even run the actual update before deploying. It appears they just test on millions of customers PC's instead of doing that internally.
Things like sending out updates in waves is what saves you from weird edge cases and odd configurations. This should have never gotten out of the test environment, as it should have caused all of their test beds to get caught in looping blue screen loop. Automated testing could have easily caught this problem in a few minutes, and would not add significant delay to the update.
With that, making sure a mirror grinding rig works correctly is much easier.I look at this like not having a test of Hubble’s mirror independent of the grinding rig. You’d have to have a pretty stupid error to need such a test.
It’s the classic “if we do it right, the test will be unnecessary”. But that’s true of most tests, so why not just do the easy test to make sure?
It's not always impossible. It's often infeasible (read: not as profitable). But anyway ...Yes, it is impossible to prove you can cover all possible case. However, this was not some weird edge condition. This was something that most every reasonable test case was going to demonstrate the flaw.