As has been previously addressed, maybe they shouldn't be running certain parts in Ring 0, but AV software necessarily has to run with very high privileges to do the job. It's an unfortunately necessary evil."full Root Cause Analysis". Well, one of the problems was that their software is running as root.
They need to go beyond this and either switch the kernel modules to a memory safe language and implementation and/or switch to better standardized APIs so that no validation check issue can result in a crash loop like this ever again.The company is specifically including "additional validation checks to the Content Validator"
Yea, you can't just rely on unit tests and validators and sub-module checks, etc. They are good practice but are insufficient to ship. Seems like someone didn't learn that very important lesson in their coding bootcamp.Interesting that they detail testing some releases in a staging environment containing multiple types of sample systems:
"On March 05, 2024, a stress test of the IPC Template Type was executed in our staging environment, which consists of a variety of operating systems and workloads. The IPC Template Type passed the stress test and was validated for use."
...but not for the update that caused this problem:
"On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data."
Which at least strongly implies that content updates are not routinely tested in staging, but rather solely through the validator tool. I'm sure there are reasons, probably involving speed of deployment since they're previously mentioned deploying these content updates multiple times per day.
I hope that one of the lessons learned from this is that there is no substitute for testing in a pre-production environment that represents prod with some degree of accuracy.
Send him an email, and ask him how he likes the stock price nowI went through the full round and got no offer, but it was a huge red flag to me that the hiring manager, when asked why he liked working at CrowdStrike, wouldn't shut up about the "stock price." He seemed not very interested in much beyond the fact that it was "fast growing." I suspect many managers were there for the IPO and don't give a fuck as long as they get their payout.

Typically in a case like this, the root cause analysis is kind of crazy to read. Obviously, the company is at fault, but the accident chain is always a pretty wild series of coincidences that always leaves me with some degree of feeling "there but for the grace of God go I." The most recent Microsoft hack was a result of this ultra-improbable cascade of a crash dump in the super-secure system ending up on a less-secure dev's machine for debugging, an obscure bug that didn't redact data in the crash dump, the dev's machine's being compromised, and the attackers just getting ridiculously lucky as they were combing through it, etc. Airline disasters have the same tone to them; it's always a chain of improbable stuff, even if there were serious mistakes along the way.How on earth did they not:
1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?
It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.
this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
Yea, i've been part of things that have gone through the root cause analysis and come out.. "ok.. that's a one in a million chance of all those bad things lining up like this, and it shouldnt have happened but it's understandable how this didn't get caught" .Typically in a case like this, the root cause analysis is kind of crazy to read. Obviously, the company is at fault, but the accident chain is always a pretty wild series of coincidences that always leaves me with some degree of feeling "there but for the grace of God go I." The most recent Microsoft hack was a result of this ultra-improbable cascade of a crash dump in the super-secure system ending up on a less-secure dev's machine for debugging, an obscure bug that didn't redact data in the crash dump, the dev's machine's being compromised, and the attackers just getting ridiculously lucky as they were combing through it, etc. Airline disasters have the same tone to them; it's always a chain of improbable stuff, even if there were serious mistakes along the way.
This one is just . . . they didn't stage the rollout and didn't, like, test it? WTF? I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.
How is it possible, in the year 2024, that this wasn't already in place? Pure incompetence.The biggest change will probably be "a staggered deployment strategy for Rapid Response Content" going forward. In a staggered deployment system, updates are initially released to a small group of PCs, and then availability is slowly expanded once it becomes clear that the update isn't causing major problems.
Swiss cheese theory of holes aligning, a bad template instance shouldn't be able to hose the core sensor, but a second bug might make that possible (especially if broken content is typically weeded out at the validation phase).Interesting that they detail testing some releases in a staging environment containing multiple types of sample systems:
"On March 05, 2024, a stress test of the IPC Template Type was executed in our staging environment, which consists of a variety of operating systems and workloads. The IPC Template Type passed the stress test and was validated for use."
...but not for the update that caused this problem:
"On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data."
It's the swiss cheese model... all the holes lined up and the problem got through. This is big in aviation where one little mistake if it lines up with other little problems can litterally kill you.Yea, i've been part of things that have gone through the root cause analysis and come out.. "ok.. that's a one in a million chance of all those bad things lining up like this, and it shouldnt have happened but it's understandable how this didn't get caught" .
This was absolutely not that. It was clearly one of those very preventable things.
I believe no hard evidence for the crash dump theory was ever provided and as such it was later recanted.Typically in a case like this, the root cause analysis is kind of crazy to read. Obviously, the company is at fault, but the accident chain is always a pretty wild series of coincidences that always leaves me with some degree of feeling "there but for the grace of God go I." The most recent Microsoft hack was a result of this ultra-improbable cascade of a crash dump in the super-secure system ending up on a less-secure dev's machine for debugging, an obscure bug that didn't redact data in the crash dump, the dev's machine's being compromised, and the attackers just getting ridiculously lucky as they were combing through it, etc. Airline disasters have the same tone to them; it's always a chain of improbable stuff, even if there were serious mistakes along the way.
Money.How is it possible, in the year 2024, that this wasn't already in place? Pure incompetence.
Because the above would cost money and require some degree of forethought and caution, and that's a crime for a CEO! Don't you know that a CEO's highest duty is to make as much money as possible? This duty explains and totally excuses literally putting lives at risk, abusing their employees, breaking laws, having no morals, and assisting others with crimes. Don't you know that a CEO doesn't want to torture small puppies and take food from the mouths of infants, their fiduciary duty requires them to do so. That's why they pay themselves so much, and make everyone but themselves pay the price for their failures, because of how much being a leader is an imposition on them! /sHow on earth did they not:
1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?
It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.
this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
A memory safe language would not fix the issue with trusting/executing external files.They need to go beyond this and either switch the kernel modules to a memory safe language and implementation and/or switch to better standardized APIs so that no validation check issue can result in a crash loop like this ever again.
Edit: on more recent linux kernels Crowdstrike uses ebpf-sandboxed code, something similar would be needed on Windows
I think that's code for "Unit tests".So are they saying that they relied on a content validator instead of pushing to an actual system? The fact that this happened to every Windows system it touched is damning.
I would like to know how you propose to run malware detection and analysis from an unprivileged position that's not allowed to do things like arbitrary memory and process and file inspections ?As been addressed far too many times, it's laziness and incompetence not a necessary evil. Some that should also be attributed to Microsoft especially OS design.
Many of the aircraft investigations I've seen have a strong flavor of "how well the holes in the swiss cheese slices lined up" to them. Somehow, I don't expect that in this case, but it's possible.Yea, i've been part of things that have gone through the root cause analysis and come out.. "ok.. that's a one in a million chance of all those bad things lining up like this, and it shouldnt have happened but it's understandable how this didn't get caught" .
This was absolutely not that. It was clearly one of those very preventable things.
Exactly. This mistake should bankrupt them, because they clearly are MAJORLY deficit in their testing and controls of something so vital.How on earth did they not:
1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?
It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.
this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
Send him an email, and ask him how he likes the stock price now
![]()
How on earth did they not:
1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?
It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.
this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
Another worrying issue is that if their agent will accept a malformed channel file, how hard would it be to get it to load a phony channel file? We already know its validation is broken.