Microsoft says 8.5M systems hit by CrowdStrike BSOD, releases USB recovery tool

heyitry

Seniorius Lurkius
38
It's worth pointing out that you can use any old Windows install bootable drive/disk, and press Shift+F10 (F8 on custom WinPE bootable drives) to get a command prompt window. That doesn't automatically delete the file, of course, but if (like some of our systems) newer hardware isn't supported by the default WinPE/WinRE images you have, it's possible to load the needed drivers and then delete the file yourself, just as an example.
 
Upvote
101 (101 / 0)

preeefix

Wise, Aged Ars Veteran
181
Subscriptor++
Credit where credit is due, Microsoft is kind-of coming in clutch here. I was at Whole Foods yesterday and saw a tech manually reflashing a Self-Checkout PoS and I was thinking that automating this by using a flash-drive would probably be the best in-the-middle approach.

The only "better" way would be if they could just automate the entire bitlocker section as well, considering that they control the secureboot signing keys.
 
Upvote
128 (133 / -5)

stormcrash

Ars Legatus Legionis
10,808
It's worth pointing out that you can use any old Windows install bootable drive/disk, and press Shift+F10 (F8 on custom WinPE bootable drives) to get a command prompt window. That doesn't automatically delete the file, of course, but if (like some of our systems) newer hardware isn't supported by the default WinPE/WinRE images you have, it's possible to load the needed drivers and then delete the file yourself, just as an example.
I think the problem there is you still need the bitlocker key to access the encrypted drive contents, otherwise you're going to be doing a nuke and pave reinstall
 
Upvote
107 (108 / -1)
David Plummer, former Microsoft programmer, has an interesting top-level review of the incident for those who want a bit more information.

The crux of it is that Crowdstrike uses a kernel-level driver that essentially parses the definition file - which is essentially a script file- at ring-0 level. Worse, it does not do any sanity checking on the file beforehand. Because the driver is marked as 'necessary to boot', which catches you in a boot-loop when it crashes. And just to rub salt in the wound, the Crowdstrike update ignored any staging instructions set up by the administrator, so it got pushed to /every/ machine on the network. Thus, rather than a few computers being affected, every computer that used Crowdstrike crashed.

A manually instigated safe-mode boot will get you out of the boot-crash loop by bypassing the Crowdstrike driver, allowing you to delete the broken update. But since this requires physical access, something made difficult by the sheer number of machines affected and the difficulty in reaching some of them, makes this a difficult clean-up.
 
Upvote
269 (271 / -2)
Post content hidden for low score. Show…

Dumb Svengali

Ars Scholae Palatinae
646
"While software updates may occasionally cause disturbances, significant incidents like the CrowdStrike event are infrequent," wrote Microsoft VP of Enterprise and OS Security David Weston in a blog post. "We currently estimate that CrowdStrike’s update affected 8.5 million Windows devices, or less than one percent of all Windows machines. While the percentage was small, the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services."

I'm not in IT so I can't comment on the technical issues, but as a lowly comms hack - this statement is just yeccccchhh. Just overly workshopped garbage.

"These big events are rare. As many as 8.5 million Windows devices have been impacted. That's a small percent of Windows machines, but CrowdStrike's importance to key businesses has painfully multiplied the impact. We've developed a tool to more easily fix the machines that can't be fixed with 10-15 reboot attempts, and more tools and fixes are in the works. Stay tuned."

See, it's easy to sound like a human and do the same thing. Cut the corporate filler and ass-covering and ship the sentence. It builds trust and isn't so subtly unsettling.
 
Upvote
6 (60 / -54)

flerchin

Ars Scholae Palatinae
927
Subscriptor
I'm not in IT so I can't comment on the technical issues, but as a lowly comms hack - this statement is just yeccccchhh. Just overly workshopped garbage.

"These big events are rare. As many as 8.5 million Windows devices have been impacted. That's a small percent of Windows machines, but CrowdStrike's importance to key businesses has painfully multiplied the impact. We've developed a tool to more easily fix the machines that can't be fixed with 10-15 reboot attempts, and more tools and fixes are in the works. Stay tuned."

See, it's easy to sound like a human and do the same thing. Cut the corporate filler and ass-covering and ship the sentence. It builds trust and isn't so subtly unsettling.
Maybe I've been in corporate too long, these read approximately the same to me.
 
Upvote
120 (124 / -4)
https://www.techradar.com/pro/secur...k-down-windows-following-crowdstrike-incident
I tend to agree with Microsoft here.

The government has an important anti-trust role to play, but having a role to play and playing it well are not the same thing.
I disagree. It's difficult because CrowdStrike is a security provider, and they need to be perfectly rigorous in their own checks (do they not send their code to an in-house test machine first...?). But at the same time, Windows having complete control of kernel-level software would not necessarily stop attacks at that level. In fact, I'd argue you would have both 1.) fewer eyes on the kernel code and thus less ability to catch attacks and 2.) you'd be leaving 100% of the reporting responsibility in Microsoft's hands, we've seen how well that goes recently...
 
Last edited:
Upvote
52 (65 / -13)

Andrewcw

Ars Legatus Legionis
18,978
Subscriptor
When the Cloud/VM portion of these 8.5M machines get to find out if their rollback from ransomware mitigation plan works.
Nice of MS to help fix the fuck up that Crowdstrike made of things. I'm glad they are trying to help their customers instead of just blaming the offending party.
They had to. Nothing says ooh look buy Apple then every screen you see is a BSOD to the public. Worse yet you'll have some corporate type idiot want to switch not realizing what those implications could be.
 
Upvote
76 (83 / -7)

ranthog

Ars Legatus Legionis
15,240
Upvote
31 (50 / -19)

wxfisch

Ars Scholae Palatinae
949
Subscriptor++
Hopefully this is also a wake up call to the effected companies to make sure their disaster recovery plans actually work.

This won't be the last time something like this happens.
I mean if the recovery plan for something like this is call in all the admins and have them go desk to desk fixing PCs, that is still going to take time. That doesn't mean your recovery plans are bad, just that some recoveries are more painful than others. The costs of this disruption are still likely a lot lower than the cost to many companies to say store backup images of every user desktop that they could recover to in a rare case like this.
 
Upvote
73 (73 / 0)

H2O Rip

Ars Tribunus Militum
2,128
Subscriptor++
David Plummer, former Microsoft programmer, has an interesting top-level review of the incident for those who want a bit more information.

The crux of it is that Crowdstrike uses a kernel-level driver that essentially parses the definition file - which is essentially a script file- at ring-0 level. Worse, it does not do any sanity checking on the file beforehand. Because the driver is marked as 'necessary to boot', which catches you in a boot-loop when it crashes. And just to rub salt in the wound, the Crowdstrike update ignored any staging instructions set up by the administrator, so it got pushed to /every/ machine on the network. Thus, rather than a few computers being affected, every computer that used Crowdstrike crashed.

A manually instigated safe-mode boot will get you out of the boot-crash loop by bypassing the Crowdstrike driver, allowing you to delete the broken update. But since this requires physical access, something made difficult by the sheer number of machines affected and the difficulty in reaching some of them, makes this a difficult clean-up.
Dave's video was a really nice overview, I watched it last night.

I am curious as to the % of systems impacted use bitlocker vs those that dont. Every work system Ive had for ages have required bitlocker, and that did add an extra step in fixing this situation for my system.

No real good answers here, msft is in an unenviable position overall given the nature of the issue. I am curious how they will change requirements for anything running in kernel mode after because any real good avoidance here has risks involed here too. My gut would be some kind of backoff approach to disable the offending software package, but I am sure it could be intentionally abused too.
 
Upvote
34 (34 / 0)
Credit where credit is due, Microsoft is kind-of coming in clutch here. I was at Whole Foods yesterday and saw a tech manually reflashing a Self-Checkout PoS and I was thinking that automating this by using a flash-drive would probably be the best in-the-middle approach.

The only "better" way would be if they could just automate the entire bitlocker section as well, considering that they control the secureboot signing keys.

If MS were capable of automating the bitlocker section just because they control the secure boot keys that would be big news in a bad, bad, way. There isn't supposed to be a vendor backdoor in bitlocker; and the (platform specific) TPM implementation is supposed to refuse to unseal the usual bitlocker key if it detects tampering or nonstandard boot conditions.

Now, what they could have done; but appear not to; is automated the process of pulling a bunch of bitlocker recovery keys onto the recovery medium(from either AD or AAD, if you are using MS backup mechanisms; or just a CSV to cover the remaining cases) and having the correct one applied, if you have it, based on the volume ID.

Not sure if that was just significantly more work to wrap up; or if they were worried about an EZ-export-keys tool encouraging bad security practices(it's not like someone with the correct privileges couldn't recurse all the keys out fairly quickly with AD powershell; but you don't see vendor tools encouraging you to do that and then copy them to a flash drive.

That said, anyone in the position of handing out recovery keys this liberally should be preparing to rotate the keys once things are back online and amenable to central management; but I suspect that only some outfits will actually do so; so having USB drives full of hundreds or thousands of recovery keys in random techs' bags of stuff might become an issue once the immediate fixing is done.
 
Upvote
64 (66 / -2)

jhodge

Ars Tribunus Angusticlavius
8,661
Subscriptor++
IMO, CrowdStrike need to clearly and transparently explain how this happened.

  • exactly what is their pre-release test protocol?
  • was that protocol followed in this case?
  • if so, how could they have missed an issue of this magnitude?
  • if not, why is it possible to bypass testing?

...and most importantly:

- how will they be modifying their test protocol to ensure that this can not happen again?

I need to hear this because I can't understand how any sort of modern continuous integration and testing software development process could have shipped this to customers. It didn't only trigger in rare conditions - it killed absolutely vanilla Windows installations and did so consistently. Do they really run an environment where they ship code (ok, definition updates in this case) to customers without even a limited internal deployment to a test farm first?

Now that the acute issue is on it's was to resolution, we all deserve some answers.
 
Upvote
235 (237 / -2)
That is one heck of a claim, given Microsoft could have simply created the types of API's necessary for this type of server, which is vendor independent.
Perhaps it wouldn't be the case anymore, but I can totally see how doing that in the past could have been a performance nightmare. AV/Antimalware is already excellent at bogging a system down to its knees, now imagine that with the overhead of interprocess messaging between kernel space and user space drivers
 
Upvote
9 (19 / -10)

mmiller7

Ars Legatus Legionis
12,349
I think the problem there is you still need the bitlocker key to access the encrypted drive contents, otherwise you're going to be doing a nuke and pave reinstall
Every procedure I've seen requires the recovery key though, not just the "regular" user-key. That was one of the limitations we had where IT took my laptop to fix but then they didn't yet have access to the bitlocker recovery key system and couldn't use my own PIN to unlock it (even had me try typing it in the command window to attempt unlocking, it didn't work). I could put my PIN in to reach the BSOD...but not do any kind of repair.

I would love to know the technical reason why there couldn't be a fix-it thing which took my normal bootup bitlocker key to unlock the hard disk and fixed the bad file automatically...ESPECIALLY if its a tool built by Microsoft that has all the right magic SecureBoot and other signing keys
 
Upvote
23 (23 / 0)

SGJ

Ars Praetorian
519
Subscriptor++
According to The Register Crowdstrike's failure to adequately sanitise input extends to the Linux version of the Falcon sensor as well as Windows. I think the failure also implies that they aren't utilising fuzzing to test their software.

Given that the Crowdstrike CEO was CTO at McAfee in 2010 when they were responsible for a similar incident I suggest watching George Kurtz's future career trajectory and avoiding software from any company that employs him!
 
Upvote
149 (150 / -1)

nzeid

Ars Praetorian
575
Subscriptor
The "easy" fix documented by both CrowdStrike (whose direct fault this is) and Microsoft (which has taken a lot of the blame for it in mainstream reporting, partly because of an unrelated July 18 Azure outage that had hit shortly before)

Glad this is finally getting said out loud - I joked with friends last week that Microsoft has a rock solid defamation case against news publications. Nothing without Crowdstrike went down. So why do the headlines say "Microsoft bug"???
 
Upvote
76 (80 / -4)
According to The Register Crowdstrike's failure to adequately sanitise input extends to the Linux version of the Falcon sensor as well as Windows. I think the failure also implies that they aren't utilising fuzzing to test their software.

Given that the Crowdstrike CEO was CTO at McAfee in 2010 when they were responsible for a similar incident I suggest watching George Kurtz's future career trajectory and avoiding software from any company that employs him!
It's kind of incredible that they blew up tons of linux systems a few months ago and nobody really realizes it. And it also shows that this isn't some microsoft/windows exclusive problem (I'm so sick of the "lol winowz bad" snark), it's just a risk of doing anything so close into the kernel, which is just sometimes necessary
 
Upvote
80 (85 / -5)

motytrah

Ars Tribunus Militum
2,942
Subscriptor++
Nice of MS to help fix the fuck up that Crowdstrike made of things. I'm glad they are trying to help their customers instead of just blaming the offending party.
I think they know if they get into that game it will end with a lot of customers moving things to other operating systems. There's nothing inherently complex about signage and kiosks where it HAS to be windows.
 
Upvote
33 (34 / -1)
Post content hidden for low score. Show…

Rosyna

Ars Tribunus Angusticlavius
6,966
https://www.techradar.com/pro/secur...k-down-windows-following-crowdstrike-incident
I tend to agree with Microsoft here.

The government has an important anti-trust role to play, but having a role to play and playing it well are not the same thing.
This is extraordinarily misleading as it just would mean Defender would have to dogfood the same out-of-kernel APIs that other EDR vendors would get to remain compliant in the EU.

But Microsoft won’t do it because it doesn’t think there should be a security boundary between admin and kernel. A security boundary that’s required when implementing a replacement API.
 
Last edited:
Upvote
7 (20 / -13)
IMO, CrowdStrike need to clearly and transparently explain how this happened.

  • exactly what is their pre-release test protocol?
  • was that protocol followed in this case?
  • if so, how could they have missed an issue of this magnitude?
  • if not, why is it possible to bypass testing?

...and most importantly:

- how will they be modifying their test protocol to ensure that this can not happen again?

I need to hear this because I can't understand how any sort of modern continuous integration and testing software development process could have shipped this to customers. It didn't only trigger in rare conditions - it killed absolutely vanilla Windows installations and did so consistently. Do they really run an environment where they ship code (ok, definition updates in this case) to customers without even a limited internal deployment to a test farm first?

Now that the acute issue is on it's was to resolution, we all deserve some answers.
I hope there's industry demand for 3rd party validation from kernel driver experts.
 
Upvote
8 (13 / -5)
Do they really run an environment where they ship code (ok, definition updates in this case) to customers without even a limited internal deployment to a test farm first?
It appears they pushed a zeroed-out or otherwise corrupted definitions file. Perhaps a placeholder file.

If the definitions file pushed to customers was sent out in error (or full of errors), no amount of pre-push testing may have prevented this. Of course, they could push in stages, with the first stage going to heavily monitored systems. Imagine they will start doing this now.

The real question is why their parser didn't reject the improper file? Not only did the parser not require securely signed code, it appears to have had no validation whatsoever... This for a parser that is running in ring zero of the kernel.

Further, some have suggested that the definitions file may have been that in name only. In that it may include code that the parser executes, again, in ring zero of the kernal. Executable code that is pushed out to customers multiple times each day, often to prevent zero day exploits.

The internal testing of such frequent updates must be .. challenging. Given the lack of parser validation and the frequency of these updates, a catastrophe like this was likely inevitable.

Crowdstrike customers should be demanding detailed answers to each of these questions. There is now quite a lot of competition in this market.
 
Upvote
116 (117 / -1)

Fearknot

Ars Scholae Palatinae
1,335
https://www.techradar.com/pro/secur...k-down-windows-following-crowdstrike-incident
I tend to agree with Microsoft here.

The government has an important anti-trust role to play, but having a role to play and playing it well are not the same thing.
That's a very weak argument. If Microsoft was able to lock down the kernel, then the CrowdStrike functionality could only be offered by Microsoft. What makes you think that Microsoft engineers could never make a similar mistake as happened in this case?

Furthermore, the EU rule doesn't prevent Microsoft from making their own similar product. I don't know whether they have such a product, but obviously a number of very large companies prefers to rely on CrowdStrike.
 
Upvote
8 (23 / -15)

SGJ

Ars Praetorian
519
Subscriptor++
It's kind of incredible that they blew up tons of linux systems a few months ago and nobody really realizes it. And it also shows that this isn't some microsoft/windows exclusive problem (I'm so sick of the "lol winowz bad" snark), it's just a risk of doing anything so close into the kernel, which is just sometimes necessary
I understand why some software has to operate close to the metal but Windows and Linux operate with only two protection rings when x64 cpus support four. Perhaps there's an argument for device drivers (and the Falcon agent was written as a device driver) to sit within a ring above the kernel but below user space. Increasing amounts of software is being run in kernel space (the Wireguard VPN e.g.) partly because the transition from user space to kernel space takes too long so reducing the transition time would also need to looked at. This is very definitely not a quick fix but we must look at ways of improving software reliability and security.
 
Upvote
16 (23 / -7)

Rosyna

Ars Tribunus Angusticlavius
6,966
It appears they pushed a zeroed-out or otherwise corrupted definitions file. Perhaps a placeholder file.

If the definitions file pushed to customers was sent out in error (or full of errors), no amount of pre-push testing may have prevented this. Of course, they could push in stages, with the first stage going to heavily monitored systems. Imagine they will start doing this now.

The real questions is why their parser didn't reject the improper file? Not only did the parser not require securely signed code, it appears to have had no validation whatsoever... This for a parser that is running in ring zero of the kernel...

Further, some have suggested that the definitions file may have been that in name only. In that it may include code that the parser executes, again, in ring zero of the kernal. Executable code that is pushed out to customers multiple times each day.

Crowdstrike customers should be demanding detailed answers to each of these questions. There is now quite a lot of competition in this market.
The laid down definitions file that ended up on disk was corrupted in some manner. You can get multiple copies of the same bad 291*.*32.sys file and see that eat has a different byte sequence.

In some, it’s full of NULL bytes. In others, it’s just junk. We won’t know more until the root cause analysis is done, but I’m betting the definitions file they tested pre-deploy wasn’t corrupted.
 
Upvote
35 (35 / 0)
It appears they pushed a zeroed-out or otherwise corrupted definitions file. Perhaps a placeholder file.

If the definitions file pushed to customers was sent out in error (or full of errors), no amount of pre-push testing may have prevented this. Of course, they could push in stages, with the first stage going to heavily monitored systems. Imagine they will start doing this now.

The real question is why their parser didn't reject the improper file? Not only did the parser not require securely signed code, it appears to have had no validation whatsoever... This for a parser that is running in ring zero of the kernel.

Further, some have suggested that the definitions file may have been that in name only. In that it may include code that the parser executes, again, in ring zero of the kernal. Executable code that is pushed out to customers multiple times each day, often to prevent zero day exploits.

The internal testing of such frequent updates must be .. challenging. Given the lack of parser validation and the frequency of these updates, a catastrophe like this was likely inevitable.

Crowdstrike customers should be demanding detailed answers to each of these questions. There is now quite a lot of competition in this market.
They have telemetry from devices. We know exactly how many are affected. Rollout should cease if there's a problem.
 
Upvote
24 (25 / -1)
According to The Register Crowdstrike's failure to adequately sanitise input extends to the Linux version of the Falcon sensor as well as Windows. I think the failure also implies that they aren't utilising fuzzing to test their software.

Given that the Crowdstrike CEO was CTO at McAfee in 2010 when they were responsible for a similar incident I suggest watching George Kurtz's future career trajectory and avoiding software from any company that employs him!
Yeah, this particular incident didn't impact Linux systems, but they did have a similar event with the Linux software a month or two ago. Which is fucking wild.
 
Upvote
60 (60 / 0)
Post content hidden for low score. Show…
This is extraordinarily misleading as it just would mean Defender would have to dogfood the same out-of-kernel APIs that other EDR vendors would get to remain compliant in the EU.

But Microsoft won’t do it because it doesn’t think there should be a security boundary between admin and kernel. A security boundary that’s required when implementing a replacement API.
Do they require that of apple with gatekeeper and the other inbuilt mac AV vs competitors? I know it's been a long complaint how gimped/limited mac AV has been compared to windows precisely because of this
 
Upvote
11 (12 / -1)
Upvote
43 (43 / 0)