A single point of failure triggered the Amazon outage affecting millions

Unclebugs · Oct 24, 2025

Fixing this issue means spending even more time and money validating software which is not something the stockholders want to hear about. Of course, if all the users sue for losses which will cost more money? It will be all about the balance sheet.

UserIDAlreadyInUse · Oct 24, 2025

"Sooooo.....how was your first code commit on your first day at Amazon...?"

de.in.sf · Oct 24, 2025

"No worries, mate! We're in Oracle Cloud!"

Kiddluck · Oct 24, 2025

Was waiting for someone to do a more technical write up. I wanted to scream on my LinkedIn: “DON’T PUT ALL YOUR SERVICES/REDUNDANCIES IN ONE REGION!” But didn’t want to sound stupid/glib because the advice seems so elementary. (And the outage so global that I thought US-EAST-1 issues spread to other regions.)

But yea…

Multi-region redundant architecture please, thanks…

nzeid · Oct 24, 2025

Thank you, Dan. This is the first article I've read that explained the situation in detail.

I've had many discussions in several circles of coworkers and former coworkers about "multi-tenant" or "multi-region" architecture. The one thing that stands out to me is that companies that went the hard way and manage their rack servers, hypervisors, or even VMs directly have a much shorter and cheaper path to data center redundancy than companies that went all in on AWS. For reasons that aren't entirely clear to me, companies that have made literally billions on a cookie cutter AWS deploy with little redundancy can't seem to afford or want to pay for the additional safety. From what I've heard, using multiple IaaS providers is really fucking expensive.

jhodge · Oct 24, 2025

"The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design."

Unfortunately, economics works directly against this. Unless the cost of the occasional failure is very high, it can frequently make more sense to avoid the complexity & cost of redundant systems and all the extra design/maintenance/testing/admin they require. Eliminating single points of failure is HARD and you pretty much always miss something that doesn't show up in controlled testing.

ernestCanuck · Oct 24, 2025

I haven't seen any explanation anywhere of part of the root cause (in addition to the race condition):

As the enactor operated, it “experienced unusually high delays needing to retry its update on several of the DNS endpoints.”

What caused the "unusually high delays" to the first enactor?

no_free_lunches_for_ai · Oct 24, 2025

Technically seems less like a race condition and more like poor checking or process barriers on replacing the DNS tables. I guess it could be threading, but 'delays' seems a lot more like "we are having to wait on API calls/etc" rather than "we have a multi-threaded process running a lot of asynchronous tasks" (which would be a race condition).

graylshaped · Oct 24, 2025

...not zero failure but contained failure

Is a circuit breaker a reasonable analogy?

Snark218 · Oct 24, 2025

Single points of failure: always the hallmark of a well-run business providing a critical service.

Fatesrider · Oct 24, 2025

“The way forward,” Ookla said, “is not zero failure but contained failure, achieved through multi-region designs, dependency diversity, and disciplined incident readiness, with regulatory oversight that moves toward treating the cloud as systemic components of national and economic resilience.

* snickers *

He said "regulatory oversight".

In THIS reality.

Welp, prepare for more of this shit in the future!

caliburn · Oct 24, 2025

Of the ~15 hours it took to resolve this, I wonder how long it took to identify the root cause?

I'm curious because I wonder if previous Amazon RIFs might have inadvertently eliminated people who would have deep knowledge of the system and could potentially have identified the root cause faster? Something something institutional knowledge...

el_oscuro · Oct 24, 2025

de.in.sf said:
"No worries, mate! We're in Oracle Cloud!"

As long as you are up to date with your quarterly patches, your mandatory patches, patch molecules, weblogic patches, opatch patch installer patches, and the 27 different Java patches, you should be fine.

https://www.ampcuscyber.com/shadowo...n-records-exfiltrated-affecting-140k-tenants/

BoredSysAdmin · Oct 24, 2025

Its-not-DNS.-There-is-no-wayits-DNS.-It-was-DNS.jpeg

jnv11 · Oct 24, 2025

MrFission said:
Automating DNS changes for load balancing seems like a bad idea in the first place...

Content delivery networks have done some version of automating DNS changes for load balancing for decades. They see the source IP addresses of incoming DNS requests from users’ ISP’s DNS servers, look up where such DNS requests’ source IP addresses are likely located, and then send responses to the requests that point the requests to an active CDN server that is close to the DNS server’s apparent location. Should a CDN content server fail or get highly loaded, the CDN’s DNS server will then reply with another CDN content server’s IP address.

terrydactyl · Oct 24, 2025

Jiffylush-TheSecondComing · Oct 24, 2025

jhodge said:
"The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design."

Unfortunately, economics works directly against this. Unless the cost of the occasional failure is very high, it can frequently make more sense to avoid the complexity & cost of redundant systems and all the extra design/maintenance/testing/admin they require. Eliminating single points of failure is HARD and you pretty much always miss something that doesn't show up in controlled testing.

You have to have real business buy in to effectively test different DR scenarios. It can be hard for us to test during our scheduled downtimes because there is so much stuff that has to be done during those periods.

Rally Man · Oct 24, 2025

Here I though that the web/internet was intended for free flowing information without being centralized so that if one place went down, it wouldn't impact the rest of the system.

But the more and more the entire web is dependent on some few massive services we'll get outages like these.

One of the reasons I despise reddit. Having so many subjects dependent on this single platform where they dictate everything. I used to love going to forums but those are very few now.

Saribro · Oct 24, 2025

Well, a single point of failure AND making every gadget and your mother connected to the cloud with apparently no fallback...

anachronon · Oct 24, 2025

I've always wondered about the danger of systems that spawn processors/threads on demand. Is this not a situation that is considered - the "runaway" processor spawn, with each new processor colliding with the ones already operating? Especially when the new process spawns because the existing process is having an issue? Then two have an issue so a third begins, ad infinitum.

Hap · Oct 24, 2025

Even in my little home network, I thought I had eliminated single point of failure. Dual ISPs to a router capable of auto failover, with a second router in shadow mode that could auto take over if the first router failed. Perfect.

Nope. Fiber cable out of first router to core switch failed. Shadow router did not take over because the failure was downstream.

Then again - I have a core switch as well, no failover there, although it's pretty easy to manually bypass.

Single point failures are sometimes hard to identify.

~*ThE jEsTeR oF dArKnEsS*~ · Oct 24, 2025

ernestCanuck said:
I haven't seen any explanation anywhere of part of the root cause (in addition to the race condition):

What caused the "unusually high delays" to the first enactor?

Rookie employee plugged their vape pen into the usb port on the server to charge it and the OS bugged out a little trying to mount it.

ericball · Oct 24, 2025

Fundamentally this was a design failure in distributed coherency communication where one node was allowed to propagate stale data.

Anonymous_Coward · Oct 24, 2025

I’ll admit that when I have used AWS I used US-East-1, but that’s because other regions charged more and there weren’t closer options to my location. That cost though is likely the root of why most businesses only use it.

SinclairZX81 · Oct 24, 2025

Oh sure, I mean... HUH?

I have encountered the limit of my networking knowledge. Most of this was Greek to me.

I thought the entire fundamental structure of the Internet was intended to avoid single points of failure?

Marakai · Oct 24, 2025

UserIDAlreadyInUse said:
"Sooooo.....how was your first code commit on your first day at Amazon...?"

Heh, from first hand experience I can tell you that you won't be committing code on your first day.

Oh wait! No, you're right, it's Always Day One, after all. </gags in flashback>

Random_stranger · Oct 24, 2025

In the end, "the cloud" is just someone else's computer.

Bluck Mutter · Oct 24, 2025

anachronon said:
I've always wondered about the danger of systems that spawn processors/threads on demand. Is this not a situation that is considered - the "runaway" processor spawn, with each new processor colliding with the ones already operating? Especially when the new process spawns because the existing process is having an issue? Then two have an issue so a third begins, ad infinitum.

I spent a large part of my 45year career (now retired) developing mission critical system level "stuff" that spawned multiple threads, many interdependent.

It just the same as a monolithic, single threaded process, you do X in the code (say open a file, open a connection, do a file write, run a query etc) , you check if X completed ok before starting Y.

If it fails (and it makes given where you are in the overall process), you retry the process/sub-process and it fails again (or after some sensible number of retries) you either abend and send out an alert or pause, send out an alert and keeping trying say every 30 seconds and not 30 times a second.

If you do this for every event in your code, then you cover all bases/Murphy's Law and you don't get something like a race condition. Sure not all apps (or more specifically sections of code) need this level of checking but for the mission critical stuff (esp where it has the potential to be a single point of failure) then the effort pays off.

For Amazon not to realize that such a situation could happen (with an old fashioned "desk check" and asking "what about ....." with multiple eyeballs) is bad.

Bluck

furrythundar · Oct 24, 2025

Hap said:
Even in my little home network, I thought I had eliminated single point of failure. Dual ISPs to a router capable of auto failover, with a second router in shadow mode that could auto take over if the first router failed. Perfect.

Nope. Fiber cable out of first router to core switch failed. Shadow router did not take over because the failure was downstream.

Then again - I have a core switch as well, no failover there, although it's pretty easy to manually bypass.

Single point failures are sometimes hard to identify.

Hard to identify is certainly not the same as failing to validate that your CoB plan works.

stefan_lec · Oct 24, 2025

Ouch, I do not envy the poor people who had to debug this. Race conditions are extremely nasty bugs to find by themselves, then add in that it's happening over a giant distributed network being heavily used by third parties, and oh yeah we need a solution in like an hour because half the frickin' web is on fire?

Yeah, there's not enough money in the world to get me to sign up for that job.

jhodge · Oct 24, 2025

Jiffylush-TheSecondComing said:
You have to have real business buy in to effectively test different DR scenarios. It can be hard for us to test during our scheduled downtimes because there is so much stuff that has to be done during those periods.

I had an HA setup at one point where we thought we'd identified and eliminated all the SPOFs for critical services and tested thoroughly. Then we had a real power outage, and everything worked perfectly - UPSs carried the load until the generator spun up, ATS worked smoothly, all the various networks bits (including the upstream carrier stuff) failed over, notices went out, etc. The decision was made not to activate the DR site because the redundancy built in to the primary handled it.

Then the whole thing crashed and burned because fuel pump that moved diesel from the storage tank in the basement to the ready tank in the penthouse where the generator lived wasn't tied in to the generator. When the ready tank ran dry, down it all went.

I learned a lot about HA design that day.

fuzzyfuzzyfungus · Oct 24, 2025

nzeid said:
Thank you, Dan. This is the first article I've read that explained the situation in detail.

I've had many discussions in several circles of coworkers and former coworkers about "multi-tenant" or "multi-region" architecture. The one thing that stands out to me is that companies that went the hard way and manage their rack servers, hypervisors, or even VMs directly have a much shorter and cheaper path to data center redundancy than companies that went all in on AWS. For reasons that aren't entirely clear to me, companies that have made literally billions on a cookie cutter AWS deploy with little redundancy can't seem to afford or want to pay for the additional safety. From what I've heard, using multiple IaaS providers is really fucking expensive.

My understanding is that you run up against two classes of problem if you want to go multi-provider. The most visible; but probably ultimately secondary, one is that there's a certain amount of thumb on the scale when it comes to pricing: some of it probably justifiable at least in part(eg. chatter between services not behind a NAT gateway in the same AWS region is free; but chat between AWS services in two AWS regions is a per-GB charge, the details may or may not be fudged but story checks out that traversing an onsite switch is cheaper than hammering the WAN); and some of which seems deliberately behavior-shaping(Ingress is free! It's a christmas miracle! Egress is not free; probably not because Amazon is slumming in on ADSL!).

The bigger issue (especially if you have some amount of pricing leverage or a convincing 'we literally can't cloud this unless it's multi-cloud' story) is that everyone's abstractions are a bit different; and everyone's higher level abstractions are somewhat more different; so if you want multiple vendors you are looking at essentially having to do an abstraction layer that targets sufficiently-equivalent services that all of them offer; which means both extra work and potentially not being able to use some of the more abstracted services that one vendor or another does differently.

Worst case, essentially anyone playing at 'cloud' can be treated as a VPS outfit that has an API you can use instead of calling your rep; but the further you go from just renting VMs the more fiddly the differences can be.

Cryxx · Oct 24, 2025

jhodge said:
"The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design."

Unfortunately, economics works directly against this. Unless the cost of the occasional failure is very high, it can frequently make more sense to avoid the complexity & cost of redundant systems and all the extra design/maintenance/testing/admin they require. Eliminating single points of failure is HARD and you pretty much always miss something that doesn't show up in controlled testing.

our IT dept was effectively frozen on tracking any issues for the day.
Projects Tracking in a cloud that DNS routed to AWS-East 1
Ticketing system (same)
On-call paging system (same)
If we hadn't just moved off Slack earlier in the year we'd also had no way to communicate with each other other than email. since we're all mostly remote.
Support for 9 hospital healthcare system in a small region.
This is why I've always been cautious on moving to the cloud.
the cost for redundancy is high.
There's a push to move all of our 300 SQL databases for research and other apps to the cloud, things that do analytics, reporting, monitor babies in the ICU etc.
I have just enough input to advise that would be terrible advice, and this re-enforced that for now. Also I told them that it wouldn't save us any $. Because how long we keep some of these records (26 years if it's data related to the medical record). We're already in the 100's of TB's of storage. Let's just keep both data centers we can manage here.

Moving back from the cloud is much more costly than moving into it. That's on purpose.

Fred Duck · Oct 24, 2025

reads about two Enactors enacting on the same data at the same time

Ah, yes, reminds me of the fun we've had group collaborating on shared documents. Cheers, cloud, you make everything better.

Too many companies decided they couldn't be bothered maintaining their own servers and paid for it to be Somebody Else's Problem. Some have commented in earlier articles that well, AWS have good uptime overall but who really wants to be at the mercy of their vendor's competence?

Even worse, when much of the Internet was inaccessible, some people had to go outside into the sunshine because there was nothing else to do! We can't have that.

SeeUnknown · Oct 24, 2025

Why use one region? It can be explained by the punctuation:

$$

Remember Amazon is pay as you go.

John.Flick · Oct 24, 2025

The real problem is that many companies end up relying on us-east-1 because it's the most economically place to host if you're trying to serve the US and Europe. Every single one of these companies should of bean ready to scale in Ohio but you'd be shocked how many companies aren't able to because their microservices architecture is a bunch of bespoke bullshit glued together with popsicle sticks and cant rapidly scale up from a dormant or less-busy region.

Most people running schlocky SAAS don't care if they go down during a major AWS even in us-east. They're not that important. That many banks online services failed is a bit... shocking. Spending too much money on stock buybacks there C-Suite?

Coinneach · Oct 24, 2025

My staff and I were in our weekly standup Wednesday morning just after the root cause was announced. The boss hadn't seen it yet and asked if there were any updates.

Me, deadpan: "DNS."

Everyone else laughed because, well.

Me: not laughing

Everyone: "...wait, seriously?"

I have a print of "Days Since It Was DNS: 0" on my cube wall for a reason, kids.

A single point of failure triggered the Amazon outage affecting millions

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Centurion

Wise, Aged Ars Veteran

Ars Praetorian

Ars Tribunus Angusticlavius

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Ars Legatus Legionis

Ars Legatus Legionis

Ars Legatus Legionis

Ars Praetorian

Ars Praefectus

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Smack-Fu Master, in training

Ars Centurion

Ars Centurion

Ars Centurion

Ars Legatus Legionis

Ars Centurion

Wise, Aged Ars Veteran

Smack-Fu Master, in training

Ars Praefectus

Ars Scholae Palatinae

Ars Praefectus

Smack-Fu Master, in training

Ars Praetorian

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Ars Centurion

Ars Tribunus Angusticlavius

Ars Praetorian

Ars Centurion

Ars Centurion