Microsoft says 8.5M systems hit by CrowdStrike BSOD, releases USB recovery tool

heyitry · Jul 22, 2024

It's worth pointing out that you can use any old Windows install bootable drive/disk, and press Shift+F10 (F8 on custom WinPE bootable drives) to get a command prompt window. That doesn't automatically delete the file, of course, but if (like some of our systems) newer hardware isn't supported by the default WinPE/WinRE images you have, it's possible to load the needed drivers and then delete the file yourself, just as an example.

preeefix · Jul 22, 2024

Credit where credit is due, Microsoft is kind-of coming in clutch here. I was at Whole Foods yesterday and saw a tech manually reflashing a Self-Checkout PoS and I was thinking that automating this by using a flash-drive would probably be the best in-the-middle approach.

The only "better" way would be if they could just automate the entire bitlocker section as well, considering that they control the secureboot signing keys.

stormcrash · Jul 22, 2024

heyitry said:
It's worth pointing out that you can use any old Windows install bootable drive/disk, and press Shift+F10 (F8 on custom WinPE bootable drives) to get a command prompt window. That doesn't automatically delete the file, of course, but if (like some of our systems) newer hardware isn't supported by the default WinPE/WinRE images you have, it's possible to load the needed drivers and then delete the file yourself, just as an example.

I think the problem there is you still need the bitlocker key to access the encrypted drive contents, otherwise you're going to be doing a nuke and pave reinstall

SavedByTechnology · Jul 22, 2024

“Flight delays and cancellations were no longer front-page news”

Delta may not be front page news, but with 700 flights cancelled today, they should be.

Spalls · Jul 22, 2024

David Plummer, former Microsoft programmer, has an interesting top-level review of the incident for those who want a bit more information.

The crux of it is that Crowdstrike uses a kernel-level driver that essentially parses the definition file - which is essentially a script file- at ring-0 level. Worse, it does not do any sanity checking on the file beforehand. Because the driver is marked as 'necessary to boot', which catches you in a boot-loop when it crashes. And just to rub salt in the wound, the Crowdstrike update ignored any staging instructions set up by the administrator, so it got pushed to /every/ machine on the network. Thus, rather than a few computers being affected, every computer that used Crowdstrike crashed.

A manually instigated safe-mode boot will get you out of the boot-crash loop by bypassing the Crowdstrike driver, allowing you to delete the broken update. But since this requires physical access, something made difficult by the sheer number of machines affected and the difficulty in reaching some of them, makes this a difficult clean-up.

Dumb Svengali · Jul 22, 2024

"While software updates may occasionally cause disturbances, significant incidents like the CrowdStrike event are infrequent," wrote Microsoft VP of Enterprise and OS Security David Weston in a blog post. "We currently estimate that CrowdStrike’s update affected 8.5 million Windows devices, or less than one percent of all Windows machines. While the percentage was small, the broad economic and societal impacts reflect the use of CrowdStrike by enterprises that run many critical services."

I'm not in IT so I can't comment on the technical issues, but as a lowly comms hack - this statement is just yeccccchhh. Just overly workshopped garbage.

"These big events are rare. As many as 8.5 million Windows devices have been impacted. That's a small percent of Windows machines, but CrowdStrike's importance to key businesses has painfully multiplied the impact. We've developed a tool to more easily fix the machines that can't be fixed with 10-15 reboot attempts, and more tools and fixes are in the works. Stay tuned."

See, it's easy to sound like a human and do the same thing. Cut the corporate filler and ass-covering and ship the sentence. It builds trust and isn't so subtly unsettling.

Lee Vann · Jul 22, 2024

Nice of MS to help fix the fuck up that Crowdstrike made of things. I'm glad they are trying to help their customers instead of just blaming the offending party.

fractl · Jul 22, 2024

SavedByTechnology said:
“Flight delays and cancellations were no longer front-page news”

Delta may not be front page news, but with 700 flights cancelled today, they should be.

I have a co-worker who‘s been stuck since Saturday. He might get home mid-week.

flerchin · Jul 22, 2024

panoptotron said:
I'm not in IT so I can't comment on the technical issues, but as a lowly comms hack - this statement is just yeccccchhh. Just overly workshopped garbage.

"These big events are rare. As many as 8.5 million Windows devices have been impacted. That's a small percent of Windows machines, but CrowdStrike's importance to key businesses has painfully multiplied the impact. We've developed a tool to more easily fix the machines that can't be fixed with 10-15 reboot attempts, and more tools and fixes are in the works. Stay tuned."

See, it's easy to sound like a human and do the same thing. Cut the corporate filler and ass-covering and ship the sentence. It builds trust and isn't so subtly unsettling.

Maybe I've been in corporate too long, these read approximately the same to me.

fenncruz · Jul 22, 2024

Hopefully this is also a wake up call to the effected companies to make sure their disaster recovery plans actually work.

This won't be the last time something like this happens.

ChefJeff789 · Jul 22, 2024

The Real Blastdoor said:
https://www.techradar.com/pro/secur...k-down-windows-following-crowdstrike-incident
I tend to agree with Microsoft here.

The government has an important anti-trust role to play, but having a role to play and playing it well are not the same thing.

I disagree. It's difficult because CrowdStrike is a security provider, and they need to be perfectly rigorous in their own checks (do they not send their code to an in-house test machine first...?). But at the same time, Windows having complete control of kernel-level software would not necessarily stop attacks at that level. In fact, I'd argue you would have both 1.) fewer eyes on the kernel code and thus less ability to catch attacks and 2.) you'd be leaving 100% of the reporting responsibility in Microsoft's hands, we've seen how well that goes recently...

Andrewcw · Jul 22, 2024

When the Cloud/VM portion of these 8.5M machines get to find out if their rollback from ransomware mitigation plan works.

Lee Vann said:
Nice of MS to help fix the fuck up that Crowdstrike made of things. I'm glad they are trying to help their customers instead of just blaming the offending party.

They had to. Nothing says ooh look buy Apple then every screen you see is a BSOD to the public. Worse yet you'll have some corporate type idiot want to switch not realizing what those implications could be.

jhodge · Jul 22, 2024

The need to manually fix each affected system runs up hard against the ever-increasing ratio of systems : staff. Systems management tooling has become so much better over the years that it's common to have one tech supporting hundreds of standardized systems day-to-day.

ranthog · Jul 22, 2024

The Real Blastdoor said:
https://www.techradar.com/pro/secur...k-down-windows-following-crowdstrike-incident
I tend to agree with Microsoft here.

The government has an important anti-trust role to play, but having a role to play and playing it well are not the same thing.

That is one heck of a claim, given Microsoft could have simply created the types of API's necessary for this type of server, which is vendor independent.

wxfisch · Jul 22, 2024

fenncruz said:
Hopefully this is also a wake up call to the effected companies to make sure their disaster recovery plans actually work.

This won't be the last time something like this happens.

I mean if the recovery plan for something like this is call in all the admins and have them go desk to desk fixing PCs, that is still going to take time. That doesn't mean your recovery plans are bad, just that some recoveries are more painful than others. The costs of this disruption are still likely a lot lower than the cost to many companies to say store backup images of every user desktop that they could recover to in a rare case like this.

H2O Rip · Jul 22, 2024

Spalls said:
David Plummer, former Microsoft programmer, has an interesting top-level review of the incident for those who want a bit more information.

The crux of it is that Crowdstrike uses a kernel-level driver that essentially parses the definition file - which is essentially a script file- at ring-0 level. Worse, it does not do any sanity checking on the file beforehand. Because the driver is marked as 'necessary to boot', which catches you in a boot-loop when it crashes. And just to rub salt in the wound, the Crowdstrike update ignored any staging instructions set up by the administrator, so it got pushed to /every/ machine on the network. Thus, rather than a few computers being affected, every computer that used Crowdstrike crashed.

A manually instigated safe-mode boot will get you out of the boot-crash loop by bypassing the Crowdstrike driver, allowing you to delete the broken update. But since this requires physical access, something made difficult by the sheer number of machines affected and the difficulty in reaching some of them, makes this a difficult clean-up.

Dave's video was a really nice overview, I watched it last night.

I am curious as to the % of systems impacted use bitlocker vs those that dont. Every work system Ive had for ages have required bitlocker, and that did add an extra step in fixing this situation for my system.

No real good answers here, msft is in an unenviable position overall given the nature of the issue. I am curious how they will change requirements for anything running in kernel mode after because any real good avoidance here has risks involed here too. My gut would be some kind of backoff approach to disable the offending software package, but I am sure it could be intentionally abused too.

fuzzyfuzzyfungus · Jul 22, 2024

crdnl said:
Credit where credit is due, Microsoft is kind-of coming in clutch here. I was at Whole Foods yesterday and saw a tech manually reflashing a Self-Checkout PoS and I was thinking that automating this by using a flash-drive would probably be the best in-the-middle approach.

The only "better" way would be if they could just automate the entire bitlocker section as well, considering that they control the secureboot signing keys.

If MS were capable of automating the bitlocker section just because they control the secure boot keys that would be big news in a bad, bad, way. There isn't supposed to be a vendor backdoor in bitlocker; and the (platform specific) TPM implementation is supposed to refuse to unseal the usual bitlocker key if it detects tampering or nonstandard boot conditions.

Now, what they could have done; but appear not to; is automated the process of pulling a bunch of bitlocker recovery keys onto the recovery medium(from either AD or AAD, if you are using MS backup mechanisms; or just a CSV to cover the remaining cases) and having the correct one applied, if you have it, based on the volume ID.

Not sure if that was just significantly more work to wrap up; or if they were worried about an EZ-export-keys tool encouraging bad security practices(it's not like someone with the correct privileges couldn't recurse all the keys out fairly quickly with AD powershell; but you don't see vendor tools encouraging you to do that and then copy them to a flash drive.

That said, anyone in the position of handing out recovery keys this liberally should be preparing to rotate the keys once things are back online and amenable to central management; but I suspect that only some outfits will actually do so; so having USB drives full of hundreds or thousands of recovery keys in random techs' bags of stuff might become an issue once the immediate fixing is done.

jhodge · Jul 22, 2024

IMO, CrowdStrike need to clearly and transparently explain how this happened.

exactly what is their pre-release test protocol?
was that protocol followed in this case?
if so, how could they have missed an issue of this magnitude?
if not, why is it possible to bypass testing?

...and most importantly:

- how will they be modifying their test protocol to ensure that this can not happen again?

I need to hear this because I can't understand how any sort of modern continuous integration and testing software development process could have shipped this to customers. It didn't only trigger in rare conditions - it killed absolutely vanilla Windows installations and did so consistently. Do they really run an environment where they ship code (ok, definition updates in this case) to customers without even a limited internal deployment to a test farm first?

Now that the acute issue is on it's was to resolution, we all deserve some answers.

stormcrash · Jul 22, 2024

ranthog said:
That is one heck of a claim, given Microsoft could have simply created the types of API's necessary for this type of server, which is vendor independent.

Perhaps it wouldn't be the case anymore, but I can totally see how doing that in the past could have been a performance nightmare. AV/Antimalware is already excellent at bogging a system down to its knees, now imagine that with the overhead of interprocess messaging between kernel space and user space drivers

mmiller7 · Jul 22, 2024

stormcrash said:
I think the problem there is you still need the bitlocker key to access the encrypted drive contents, otherwise you're going to be doing a nuke and pave reinstall

Every procedure I've seen requires the recovery key though, not just the "regular" user-key. That was one of the limitations we had where IT took my laptop to fix but then they didn't yet have access to the bitlocker recovery key system and couldn't use my own PIN to unlock it (even had me try typing it in the command window to attempt unlocking, it didn't work). I could put my PIN in to reach the BSOD...but not do any kind of repair.

I would love to know the technical reason why there couldn't be a fix-it thing which took my normal bootup bitlocker key to unlock the hard disk and fixed the bad file automatically...ESPECIALLY if its a tool built by Microsoft that has all the right magic SecureBoot and other signing keys

SGJ · Jul 22, 2024

According to The Register Crowdstrike's failure to adequately sanitise input extends to the Linux version of the Falcon sensor as well as Windows. I think the failure also implies that they aren't utilising fuzzing to test their software.

Given that the Crowdstrike CEO was CTO at McAfee in 2010 when they were responsible for a similar incident I suggest watching George Kurtz's future career trajectory and avoiding software from any company that employs him!

nzeid · Jul 22, 2024

The "easy" fix documented by both CrowdStrike (whose direct fault this is) and Microsoft (which has taken a lot of the blame for it in mainstream reporting, partly because of an unrelated July 18 Azure outage that had hit shortly before)

Glad this is finally getting said out loud - I joked with friends last week that Microsoft has a rock solid defamation case against news publications. Nothing without Crowdstrike went down. So why do the headlines say "Microsoft bug"???

stormcrash · Jul 22, 2024

SGJ said:
According to The Register Crowdstrike's failure to adequately sanitise input extends to the Linux version of the Falcon sensor as well as Windows. I think the failure also implies that they aren't utilising fuzzing to test their software.

Given that the Crowdstrike CEO was CTO at McAfee in 2010 when they were responsible for a similar incident I suggest watching George Kurtz's future career trajectory and avoiding software from any company that employs him!

It's kind of incredible that they blew up tons of linux systems a few months ago and nobody really realizes it. And it also shows that this isn't some microsoft/windows exclusive problem (I'm so sick of the "lol winowz bad" snark), it's just a risk of doing anything so close into the kernel, which is just sometimes necessary

DrewW · Jul 22, 2024

fractl said:
I have a co-worker who‘s been stuck since Saturday. He might get home mid-week.

One of my coworkers in India received a hand-written boarding pass!

I'm suprised at all the random stuff that's still down.

motytrah · Jul 22, 2024

Lee Vann said:
Nice of MS to help fix the fuck up that Crowdstrike made of things. I'm glad they are trying to help their customers instead of just blaming the offending party.

I think they know if they get into that game it will end with a lot of customers moving things to other operating systems. There's nothing inherently complex about signage and kiosks where it HAS to be windows.

Dave9911 · Jul 22, 2024

We were a bit fortunate in that we had removed crowdstrike from 600+ servers a few months ago, to install a different product (inevitably because of contract pricing). This has been a very painful experience for way too many companies. MOAR Testing please!

Rosyna · Jul 22, 2024

The Real Blastdoor said:
https://www.techradar.com/pro/secur...k-down-windows-following-crowdstrike-incident
I tend to agree with Microsoft here.

The government has an important anti-trust role to play, but having a role to play and playing it well are not the same thing.

This is extraordinarily misleading as it just would mean Defender would have to dogfood the same out-of-kernel APIs that other EDR vendors would get to remain compliant in the EU.

But Microsoft won’t do it because it doesn’t think there should be a security boundary between admin and kernel. A security boundary that’s required when implementing a replacement API.

lordcheeto · Jul 22, 2024

jhodge said:
IMO, CrowdStrike need to clearly and transparently explain how this happened.

exactly what is their pre-release test protocol?

was that protocol followed in this case?

if so, how could they have missed an issue of this magnitude?

if not, why is it possible to bypass testing?

...and most importantly:

- how will they be modifying their test protocol to ensure that this can not happen again?

I need to hear this because I can't understand how any sort of modern continuous integration and testing software development process could have shipped this to customers. It didn't only trigger in rare conditions - it killed absolutely vanilla Windows installations and did so consistently. Do they really run an environment where they ship code (ok, definition updates in this case) to customers without even a limited internal deployment to a test farm first?

Now that the acute issue is on it's was to resolution, we all deserve some answers.

I hope there's industry demand for 3rd party validation from kernel driver experts.

Lemmi · Jul 22, 2024

jhodge said:
Do they really run an environment where they ship code (ok, definition updates in this case) to customers without even a limited internal deployment to a test farm first?

It appears they pushed a zeroed-out or otherwise corrupted definitions file. Perhaps a placeholder file.

If the definitions file pushed to customers was sent out in error (or full of errors), no amount of pre-push testing may have prevented this. Of course, they could push in stages, with the first stage going to heavily monitored systems. Imagine they will start doing this now.

The real question is why their parser didn't reject the improper file? Not only did the parser not require securely signed code, it appears to have had no validation whatsoever... This for a parser that is running in ring zero of the kernel.

Further, some have suggested that the definitions file may have been that in name only. In that it may include code that the parser executes, again, in ring zero of the kernal. Executable code that is pushed out to customers multiple times each day, often to prevent zero day exploits.

The internal testing of such frequent updates must be .. challenging. Given the lack of parser validation and the frequency of these updates, a catastrophe like this was likely inevitable.

Crowdstrike customers should be demanding detailed answers to each of these questions. There is now quite a lot of competition in this market.

Fearknot · Jul 22, 2024

The Real Blastdoor said:
https://www.techradar.com/pro/secur...k-down-windows-following-crowdstrike-incident
I tend to agree with Microsoft here.

The government has an important anti-trust role to play, but having a role to play and playing it well are not the same thing.

That's a very weak argument. If Microsoft was able to lock down the kernel, then the CrowdStrike functionality could only be offered by Microsoft. What makes you think that Microsoft engineers could never make a similar mistake as happened in this case?

Furthermore, the EU rule doesn't prevent Microsoft from making their own similar product. I don't know whether they have such a product, but obviously a number of very large companies prefers to rely on CrowdStrike.

SGJ · Jul 22, 2024

stormcrash said:
It's kind of incredible that they blew up tons of linux systems a few months ago and nobody really realizes it. And it also shows that this isn't some microsoft/windows exclusive problem (I'm so sick of the "lol winowz bad" snark), it's just a risk of doing anything so close into the kernel, which is just sometimes necessary

I understand why some software has to operate close to the metal but Windows and Linux operate with only two protection rings when x64 cpus support four. Perhaps there's an argument for device drivers (and the Falcon agent was written as a device driver) to sit within a ring above the kernel but below user space. Increasing amounts of software is being run in kernel space (the Wireguard VPN e.g.) partly because the transition from user space to kernel space takes too long so reducing the transition time would also need to looked at. This is very definitely not a quick fix but we must look at ways of improving software reliability and security.

Rosyna · Jul 22, 2024

Lemmi said:
It appears they pushed a zeroed-out or otherwise corrupted definitions file. Perhaps a placeholder file.

If the definitions file pushed to customers was sent out in error (or full of errors), no amount of pre-push testing may have prevented this. Of course, they could push in stages, with the first stage going to heavily monitored systems. Imagine they will start doing this now.

The real questions is why their parser didn't reject the improper file? Not only did the parser not require securely signed code, it appears to have had no validation whatsoever... This for a parser that is running in ring zero of the kernel...

Further, some have suggested that the definitions file may have been that in name only. In that it may include code that the parser executes, again, in ring zero of the kernal. Executable code that is pushed out to customers multiple times each day.

Crowdstrike customers should be demanding detailed answers to each of these questions. There is now quite a lot of competition in this market.

The laid down definitions file that ended up on disk was corrupted in some manner. You can get multiple copies of the same bad 291*.*32.sys file and see that eat has a different byte sequence.

In some, it’s full of NULL bytes. In others, it’s just junk. We won’t know more until the root cause analysis is done, but I’m betting the definitions file they tested pre-deploy wasn’t corrupted.

lordcheeto · Jul 22, 2024

Lemmi said:
It appears they pushed a zeroed-out or otherwise corrupted definitions file. Perhaps a placeholder file.

If the definitions file pushed to customers was sent out in error (or full of errors), no amount of pre-push testing may have prevented this. Of course, they could push in stages, with the first stage going to heavily monitored systems. Imagine they will start doing this now.

The real question is why their parser didn't reject the improper file? Not only did the parser not require securely signed code, it appears to have had no validation whatsoever... This for a parser that is running in ring zero of the kernel.

Further, some have suggested that the definitions file may have been that in name only. In that it may include code that the parser executes, again, in ring zero of the kernal. Executable code that is pushed out to customers multiple times each day, often to prevent zero day exploits.

The internal testing of such frequent updates must be .. challenging. Given the lack of parser validation and the frequency of these updates, a catastrophe like this was likely inevitable.

Crowdstrike customers should be demanding detailed answers to each of these questions. There is now quite a lot of competition in this market.

They have telemetry from devices. We know exactly how many are affected. Rollout should cease if there's a problem.

heartburnkid · Jul 22, 2024

SGJ said:
According to The Register Crowdstrike's failure to adequately sanitise input extends to the Linux version of the Falcon sensor as well as Windows. I think the failure also implies that they aren't utilising fuzzing to test their software.

Given that the Crowdstrike CEO was CTO at McAfee in 2010 when they were responsible for a similar incident I suggest watching George Kurtz's future career trajectory and avoiding software from any company that employs him!

Yeah, this particular incident didn't impact Linux systems, but they did have a similar event with the Linux software a month or two ago. Which is fucking wild.

stormcrash · Jul 22, 2024

Rosyna said:
This is extraordinarily misleading as it just would mean Defender would have to dogfood the same out-of-kernel APIs that other EDR vendors would get to remain compliant in the EU.

But Microsoft won’t do it because it doesn’t think there should be a security boundary between admin and kernel. A security boundary that’s required when implementing a replacement API.

Do they require that of apple with gatekeeper and the other inbuilt mac AV vs competitors? I know it's been a long complaint how gimped/limited mac AV has been compared to windows precisely because of this

heartburnkid · Jul 22, 2024

Old Bitsmasher said:
And these magic thumb drives are distributed how?

You download the script from the link in the article (https://techcommunity.microsoft.com...with-crowdstrike-issue-impacting/ba-p/4196959), plug a blank thumb drive into your PC, and run the script. Then you distribute the thumb drive to whoever in your organization needs them.

Microsoft says 8.5M systems hit by CrowdStrike BSOD, releases USB recovery tool

Seniorius Lurkius

Wise, Aged Ars Veteran

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Centurion

Ars Scholae Palatinae

Ars Praefectus

Ars Praefectus

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Tribunus Militum

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Legatus Legionis

Ars Praetorian

Ars Praetorian

Ars Legatus Legionis

Ars Tribunus Militum

Ars Tribunus Militum

Ars Centurion

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Praetorian

Ars Scholae Palatinae

Ars Praetorian

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis