I suspect their work involves a lot of things were testing is limited due to the extremely high number of combinations possible.It's not always impossible. It's often infeasible (read: not as profitable). But anyway ...
I'm definitely not in a position to roll my eyes and say that it's obvious that tests should have caught this - I have no idea how their system works. You appear to know for sure, which is great - they should fix their dumb tests. I'm down.
All I personally can roll my eyes at is their "everything, everywhere, all at once" deployment model for these Template updates, and fixing that would certainly have caught this problem before it got to paying customers.
This is not an either/or thing, so your assertion that they should fix their obviously stupid tests does not nullify my assertion that their deployment model is bonkers crazy-go-nuts. Because it's bonkers crazy-go-nuts.
Or stagger your deployment, like responsible developers have been doing since the late 90s.This is also why you usually at least do a test deployment to run the program and do some basic sanity checks, even when your automated tests came back green.
When you know who sails the ship over there, you'll know. Proven track record to wreck havoc.How on earth did they not:
1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?
It sounds like they test their content updates with a parser. Fine.. that's great.. but it's insufficient. Proper End-to-End systems testing is absolutely table stakes for this type of stuff.. this isn't just some random nodeJS module where you aren't necessarily culpable for downstream effects of breaking changes.
this is sloppy devops on two major counts. Either one of these things would've saved millions or billions of customer dollars.
Let the CEO bleed $, too!So they admit they deployed worldwide all at once. That has been against best practices for large scale deployments for more than two decades. Sue them into oblivion.
After this public FU they deserve and should loose all their customers - even if it takes time for them to move to a competitor.Let me get this straight. Ignoring the boneheaded deployment issue, a company that sells cyber security software has a sloppy verification process? That seems to belie the very essence of software testing.
Why would I ever trust them?
All it took was a paint chip.With that, making sure a mirror grinding rig works correctly is much easier.
Better question is how has this never happened before? I mean yes networking is very reliable and most download code does a good job of recovering bad downloads, but just the shear volume means someone must have gotten a corrupted download in the past. Did their machine bsod and they simply ignored it?Ummm, shouldn't they be testing the KERNEL DRIVER to ensure that it correctly rejects malformed content? It's simply not good enough to have a "Content Validator" that is intended to spare the kernel driver from being exposed to malformed content that would cause it to crash. Maybe this validator is based on the same code as used in their driver, but if so, I hope they are planning on FIXING the driver and not just the external validator.
This has nothing to do with Windows updates, though..Good thing I disabled Windows updates since Windows 7, never had an incident.
Ok so is the company liable for any of the downstream damages? (I feel like I know the answer to this.)
As much as it sucks, it's not "reasonable" to expect damages greater than the costs of the tool. The lawyer writing the contract would never agree to limitless liability. The biggest thing at stake is the company's reputation. They won't be sued out of existence, the biggest liability is that all their customer will leave which will terminate its existance.
I look at this like not having a test of Hubble’s mirror independent of the grinding rig. You’d have to have a pretty stupid error to need such a test.
It’s the classic “if we do it right, the test will be unnecessary”. But that’s true of most tests, so why not just do the easy test to make sure?
Don't really blame Microsoft here, they tried implementing an API that would harden the Kernel against this very issue, and the EU blocked MS from implementing it because they were afraid it would give Microsoft an advantage in the Security Software department.Microsoft is passing blame at EU for its policies...from 2009! Touche mon ami!
If Microsoft had implemented an API that all players would have to use, including Microsoft, then the EU would have been satisfied. Because that would have levelled the playing field and no one would have an advantage over another.Don't really blame Microsoft here, they tried implementing an API that would harden the Kernel against this very issue, and the EU blocked MS from implementing it because they were afraid it would give Microsoft an advantage in the Security Software department.
This just explains why the file on disk may be all zeroes or be partly written, causing the file to be different on different systems. I think they're saying everybody got the same corrupt update, but it was further corrupted due to the crash.CrowdStrike has released their report on why the 291*.*32.sys file may contain NULL bytes.
TL;DR it’s an artifact of Windows/NTFS that occurs after the computer crashed. This explains why so many affected people had such different copies of that definition file.
Agreed. Due to the Windows architecture, it's required for this kind of software.As has been previously addressed, maybe they shouldn't be running certain parts in Ring 0, but AV software necessarily has to run with very high privileges to do the job. It's an unfortunately necessary evil.
What's indefensible is that their CI/CD mechanisms were so shoddy that they didn't catch this.
Like with most of these cases, it's because they did something that does an end run around how the API is supposed to normally work.Agreed. Due to the Windows architecture, it's required for this kind of software.
What's especially indefensible is their driver isn't doing ANY input validation on these updates. Dave's Garage has a good explainer.
Frankly, I'm surprised this behavior is even allowed for a Ring 0 driver.
What's Money exactly? It's time. It's resources. It's a measurement of risk.Money.
They’re not actually saying anywhere that the update as deployed was corrupt. The phrase they’re using is “problematic content data”, that “passed validation” due to a bug in the validator. The “problematic” part could be corruption, but it could equally just be that it was deployed as intended but that particular configuration causes the parser to crash. As others have noted, the report singles out the bug in the validator but glosses over what is presumably a major bug in the parser. Putting these pieces together, my read is that the file that was deployed “should have worked” from the perspective of whoever set it up, or at worst contained what should’ve been a benign misconfiguration, and the real failures in the intended process were in 1) the parser failing to handle it and 2) the stress test not properly testing all possible values. (You’ll note that the report claims the stress test process “match[es] against any possible value of the associated data fields to identify adverse system interactions”, which obviously didn’t actually happen here as if it’d matched against every possible data value, it would’ve caught this issue. Or, possibly, the stress test caught it but the validator want properly updated to detect the issue; in this case, though, they’re deploying content to a template with a known kernel crash bug, so there’s really no excuse for not being exceptionally careful with every aspect of that template until the crash bug is fixed.)This just explains why the file on disk may be all zeroes or be partly written, causing the file to be different on different systems. I think they're saying everybody got the same corrupt update, but it was further corrupted due to the crash.
Even apart from corruption issues, tampering with the channel files seems an obvious attack by malware. It should be validating the content and that all are present. Removing the bad file fixes the issue, but it means malware can disable detection by removing good files.
This is what software companies want you to believe. But when gross negligence enters the chat, I believe disclaimers of fitness for purpose are rendered ineffective as a defense.As much as it sucks, it's not "reasonable" to expect damages greater than the costs of the tool. The lawyer writing the contract would never agree to limitless liability. The biggest thing at stake is the company's reputation. They won't be sued out of existence, the biggest liability is that all their customer will leave which will terminate its existance.
How do you say OS in Latin? But I suppose once you're at ring 0 you are the OS. Still, why not ring -1 ?Quis custodiet ipsos custodes?
Just to be clear - this seems to be a validation issue, not a verification issue.Let me get this straight. Ignoring the boneheaded deployment issue, a company that sells cyber security software has a sloppy verification process? That seems to belie the very essence of software testing.
Why would I ever trust them?
It's been said for years: the game is not about the money per se; the money is just a way of keeping score.What's Money exactly? It's time. It's resources. It's a measurement of risk.
I'm not saying you're wrong but that's like saying money is why it takes me an hour to get to work instead of 5 minutes or 2 days. On one level it's true but on another what you don't know is - am I only taking an hour because that's the most efficient way to get to work? Or am I taking advantage of work by getting there slower then I could? Or taking advantage of home by getting there slower then I could?
My point is that money is a metric. Implying that any time that someone does less or spends less then they otherwise could the reason is nefarious is unfair. It could also be incompetence. It could just as easily be that the way it was done really was the most efficient way possible.
Now in this case it really does seem incompetence. It also seems like Crowdstrike was using it's privileged position on it's client OS's as a lever to improve it's product - and so taking risks that arguably benefited Crowdstrike as much or more then the client. But just saying "money" doesn't tell the whole story.
Yes, because we all know Contracts Are Law.As much as it sucks, it's not "reasonable" to expect damages greater than the costs of the tool. The lawyer writing the contract would never agree to limitless liability. The biggest thing at stake is the company's reputation. They won't be sued out of existence, the biggest liability is that all their customer will leave which will terminate its existance.
This.How on earth did they not:
1) Have staggered rollout for such mission critical stuff.
2) Test it on LIVE FREAKING WINDOWS SYSTEMS instead of trusting some unit test content validator thing that's not actually a real End-to-End test ?
100% this. Dave is great at put this into terms a non-dev op can understand.What's especially indefensible is their driver isn't doing ANY input validation on these updates. Dave's Garage has a good explainer.
As has been previously addressed, maybe they shouldn't be running certain parts in Ring 0, but AV software necessarily has to run with very high privileges to do the job. It's an unfortunately necessary evil.
I've just heard that Microsoft wanted to provide such APIs for AV companies and apparently EU blocked the idea as it could harm competition
Maybe technically correct but, pragmatically useless. You do realize that windows is in the business of supporting lots of legacy compatibility things right ?You missed two words: "on Windows".
Apple had the same directive from the EU and, rather than taking the path of least work, developed a mechanism so that AV clients don't have to run at ring 0.
Microsoft could have done so but chose not to.
Apple also isn't as big of a target for malware creators, Microsoft likely saw critical value in being able to run Defender tasks in Ring 0 as a defense against that.Nope. The EU required Microsoft (and Apple) to give AV vendors access to the same API as they use. Apple wrote a system so that AV didn't have to be ring0, Microsoft didn't.
When I claimed that "they weren't doing any testing" in my previous post, I was implying that relying entirely on a automated testing routine isn't effective on its own.They said they did and do testing. The admission makes a distinction between testing (running code that challenges the new functions in various ways, ensuring the results are the ones expected) and incremental deployments with canaries.
You can think of deploying in waves as a test, but since you're deploying to production environments, a distinction is often (always, IME) made.
The problem with just testing is what we saw here: their tests were insufficient. Most organizations find it infeasible to prove mathematically that their testing is sufficient, so they fall back on sending updates out in waves in order to control the blast radius of a mistake. This was what Crowdstrike were missing with their so-called "Rapid Response" updates, and what they pledge to do in the future.