A single point of failure triggered the Amazon outage affecting millions

Kiddluck

Wise, Aged Ars Veteran
125
Subscriptor
Was waiting for someone to do a more technical write up. I wanted to scream on my LinkedIn: “DON’T PUT ALL YOUR SERVICES/REDUNDANCIES IN ONE REGION!” But didn’t want to sound stupid/glib because the advice seems so elementary. (And the outage so global that I thought US-EAST-1 issues spread to other regions.)

But yea…

Multi-region redundant architecture please, thanks…
 
Upvote
109 (116 / -7)

nzeid

Ars Praetorian
593
Subscriptor
Thank you, Dan. This is the first article I've read that explained the situation in detail.

I've had many discussions in several circles of coworkers and former coworkers about "multi-tenant" or "multi-region" architecture. The one thing that stands out to me is that companies that went the hard way and manage their rack servers, hypervisors, or even VMs directly have a much shorter and cheaper path to data center redundancy than companies that went all in on AWS. For reasons that aren't entirely clear to me, companies that have made literally billions on a cookie cutter AWS deploy with little redundancy can't seem to afford or want to pay for the additional safety. From what I've heard, using multiple IaaS providers is really fucking expensive.
 
Upvote
154 (159 / -5)

jhodge

Ars Tribunus Angusticlavius
8,726
Subscriptor++
"The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design."

Unfortunately, economics works directly against this. Unless the cost of the occasional failure is very high, it can frequently make more sense to avoid the complexity & cost of redundant systems and all the extra design/maintenance/testing/admin they require. Eliminating single points of failure is HARD and you pretty much always miss something that doesn't show up in controlled testing.
 
Upvote
184 (188 / -4)

ernestCanuck

Smack-Fu Master, in training
11
Subscriptor
I haven't seen any explanation anywhere of part of the root cause (in addition to the race condition):

As the enactor operated, it “experienced unusually high delays needing to retry its update on several of the DNS endpoints.”

What caused the "unusually high delays" to the first enactor?
 
Upvote
174 (176 / -2)
Technically seems less like a race condition and more like poor checking or process barriers on replacing the DNS tables. I guess it could be threading, but 'delays' seems a lot more like "we are having to wait on API calls/etc" rather than "we have a multi-threaded process running a lot of asynchronous tasks" (which would be a race condition).
 
Upvote
17 (29 / -12)
Post content hidden for low score. Show…

Fatesrider

Ars Legatus Legionis
25,280
Subscriptor
“The way forward,” Ookla said, “is not zero failure but contained failure, achieved through multi-region designs, dependency diversity, and disciplined incident readiness, with regulatory oversight that moves toward treating the cloud as systemic components of national and economic resilience.
* snickers *

He said "regulatory oversight".

In THIS reality.

Welp, prepare for more of this shit in the future!
 
Upvote
138 (143 / -5)

caliburn

Ars Praetorian
437
Subscriptor++
Of the ~15 hours it took to resolve this, I wonder how long it took to identify the root cause?

I'm curious because I wonder if previous Amazon RIFs might have inadvertently eliminated people who would have deep knowledge of the system and could potentially have identified the root cause faster? Something something institutional knowledge...
 
Upvote
190 (194 / -4)

el_oscuro

Ars Praefectus
3,178
Subscriptor++
Upvote
88 (89 / -1)

BoredSysAdmin

Ars Scholae Palatinae
617
Its-not-DNS.-There-is-no-wayits-DNS.-It-was-DNS.jpeg
 
Upvote
280 (280 / 0)

jnv11

Ars Scholae Palatinae
685
Automating DNS changes for load balancing seems like a bad idea in the first place...
Content delivery networks have done some version of automating DNS changes for load balancing for decades. They see the source IP addresses of incoming DNS requests from users’ ISP’s DNS servers, look up where such DNS requests’ source IP addresses are likely located, and then send responses to the requests that point the requests to an active CDN server that is close to the DNS server’s apparent location. Should a CDN content server fail or get highly loaded, the CDN’s DNS server will then reply with another CDN content server’s IP address.
 
Upvote
87 (87 / 0)
"The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design."

Unfortunately, economics works directly against this. Unless the cost of the occasional failure is very high, it can frequently make more sense to avoid the complexity & cost of redundant systems and all the extra design/maintenance/testing/admin they require. Eliminating single points of failure is HARD and you pretty much always miss something that doesn't show up in controlled testing.
You have to have real business buy in to effectively test different DR scenarios. It can be hard for us to test during our scheduled downtimes because there is so much stuff that has to be done during those periods.
 
Upvote
49 (49 / 0)
Here I though that the web/internet was intended for free flowing information without being centralized so that if one place went down, it wouldn't impact the rest of the system.

But the more and more the entire web is dependent on some few massive services we'll get outages like these.

One of the reasons I despise reddit. Having so many subjects dependent on this single platform where they dictate everything. I used to love going to forums but those are very few now.
 
Upvote
82 (85 / -3)

anachronon

Ars Centurion
242
Subscriptor++
I've always wondered about the danger of systems that spawn processors/threads on demand. Is this not a situation that is considered - the "runaway" processor spawn, with each new processor colliding with the ones already operating? Especially when the new process spawns because the existing process is having an issue? Then two have an issue so a third begins, ad infinitum.
 
Upvote
10 (12 / -2)

Hap

Ars Legatus Legionis
12,177
Subscriptor++
Even in my little home network, I thought I had eliminated single point of failure. Dual ISPs to a router capable of auto failover, with a second router in shadow mode that could auto take over if the first router failed. Perfect.

Nope. Fiber cable out of first router to core switch failed. Shadow router did not take over because the failure was downstream.

Then again - I have a core switch as well, no failover there, although it's pretty easy to manually bypass.

Single point failures are sometimes hard to identify.
 
Upvote
60 (65 / -5)
I haven't seen any explanation anywhere of part of the root cause (in addition to the race condition):



What caused the "unusually high delays" to the first enactor?

Rookie employee plugged their vape pen into the usb port on the server to charge it and the OS bugged out a little trying to mount it.
 
Upvote
113 (122 / -9)

Marakai

Ars Scholae Palatinae
895
Subscriptor++
"Sooooo.....how was your first code commit on your first day at Amazon...?"
Heh, from first hand experience I can tell you that you won't be committing code on your first day.

Oh wait! No, you're right, it's Always Day One, after all. </gags in flashback>
 
Upvote
35 (36 / -1)

Bluck Mutter

Smack-Fu Master, in training
77
I've always wondered about the danger of systems that spawn processors/threads on demand. Is this not a situation that is considered - the "runaway" processor spawn, with each new processor colliding with the ones already operating? Especially when the new process spawns because the existing process is having an issue? Then two have an issue so a third begins, ad infinitum.

I spent a large part of my 45year career (now retired) developing mission critical system level "stuff" that spawned multiple threads, many interdependent.

It just the same as a monolithic, single threaded process, you do X in the code (say open a file, open a connection, do a file write, run a query etc) , you check if X completed ok before starting Y.

If it fails (and it makes given where you are in the overall process), you retry the process/sub-process and it fails again (or after some sensible number of retries) you either abend and send out an alert or pause, send out an alert and keeping trying say every 30 seconds and not 30 times a second.

If you do this for every event in your code, then you cover all bases/Murphy's Law and you don't get something like a race condition. Sure not all apps (or more specifically sections of code) need this level of checking but for the mission critical stuff (esp where it has the potential to be a single point of failure) then the effort pays off.

For Amazon not to realize that such a situation could happen (with an old fashioned "desk check" and asking "what about ....." with multiple eyeballs) is bad.

Bluck
 
Last edited:
Upvote
58 (62 / -4)
Post content hidden for low score. Show…
Even in my little home network, I thought I had eliminated single point of failure. Dual ISPs to a router capable of auto failover, with a second router in shadow mode that could auto take over if the first router failed. Perfect.

Nope. Fiber cable out of first router to core switch failed. Shadow router did not take over because the failure was downstream.

Then again - I have a core switch as well, no failover there, although it's pretty easy to manually bypass.

Single point failures are sometimes hard to identify.
Hard to identify is certainly not the same as failing to validate that your CoB plan works.
 
Upvote
0 (2 / -2)

stefan_lec

Ars Scholae Palatinae
999
Subscriptor
Ouch, I do not envy the poor people who had to debug this. Race conditions are extremely nasty bugs to find by themselves, then add in that it's happening over a giant distributed network being heavily used by third parties, and oh yeah we need a solution in like an hour because half the frickin' web is on fire?

Yeah, there's not enough money in the world to get me to sign up for that job.
 
Upvote
92 (92 / 0)

jhodge

Ars Tribunus Angusticlavius
8,726
Subscriptor++
You have to have real business buy in to effectively test different DR scenarios. It can be hard for us to test during our scheduled downtimes because there is so much stuff that has to be done during those periods.
I had an HA setup at one point where we thought we'd identified and eliminated all the SPOFs for critical services and tested thoroughly. Then we had a real power outage, and everything worked perfectly - UPSs carried the load until the generator spun up, ATS worked smoothly, all the various networks bits (including the upstream carrier stuff) failed over, notices went out, etc. The decision was made not to activate the DR site because the redundancy built in to the primary handled it.

Then the whole thing crashed and burned because fuel pump that moved diesel from the storage tank in the basement to the ready tank in the penthouse where the generator lived wasn't tied in to the generator. When the ready tank ran dry, down it all went.

I learned a lot about HA design that day.
 
Upvote
192 (192 / 0)
Thank you, Dan. This is the first article I've read that explained the situation in detail.

I've had many discussions in several circles of coworkers and former coworkers about "multi-tenant" or "multi-region" architecture. The one thing that stands out to me is that companies that went the hard way and manage their rack servers, hypervisors, or even VMs directly have a much shorter and cheaper path to data center redundancy than companies that went all in on AWS. For reasons that aren't entirely clear to me, companies that have made literally billions on a cookie cutter AWS deploy with little redundancy can't seem to afford or want to pay for the additional safety. From what I've heard, using multiple IaaS providers is really fucking expensive.

My understanding is that you run up against two classes of problem if you want to go multi-provider. The most visible; but probably ultimately secondary, one is that there's a certain amount of thumb on the scale when it comes to pricing: some of it probably justifiable at least in part(eg. chatter between services not behind a NAT gateway in the same AWS region is free; but chat between AWS services in two AWS regions is a per-GB charge, the details may or may not be fudged but story checks out that traversing an onsite switch is cheaper than hammering the WAN); and some of which seems deliberately behavior-shaping(Ingress is free! It's a christmas miracle! Egress is not free; probably not because Amazon is slumming in on ADSL!).

The bigger issue (especially if you have some amount of pricing leverage or a convincing 'we literally can't cloud this unless it's multi-cloud' story) is that everyone's abstractions are a bit different; and everyone's higher level abstractions are somewhat more different; so if you want multiple vendors you are looking at essentially having to do an abstraction layer that targets sufficiently-equivalent services that all of them offer; which means both extra work and potentially not being able to use some of the more abstracted services that one vendor or another does differently.

Worst case, essentially anyone playing at 'cloud' can be treated as a VPS outfit that has an API you can use instead of calling your rep; but the further you go from just renting VMs the more fiddly the differences can be.
 
Upvote
76 (76 / 0)
"The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design."

Unfortunately, economics works directly against this. Unless the cost of the occasional failure is very high, it can frequently make more sense to avoid the complexity & cost of redundant systems and all the extra design/maintenance/testing/admin they require. Eliminating single points of failure is HARD and you pretty much always miss something that doesn't show up in controlled testing.
our IT dept was effectively frozen on tracking any issues for the day.
Projects Tracking in a cloud that DNS routed to AWS-East 1
Ticketing system (same)
On-call paging system (same)
If we hadn't just moved off Slack earlier in the year we'd also had no way to communicate with each other other than email. since we're all mostly remote.
Support for 9 hospital healthcare system in a small region.
This is why I've always been cautious on moving to the cloud.
the cost for redundancy is high.
There's a push to move all of our 300 SQL databases for research and other apps to the cloud, things that do analytics, reporting, monitor babies in the ICU etc.
I have just enough input to advise that would be terrible advice, and this re-enforced that for now. Also I told them that it wouldn't save us any $. Because how long we keep some of these records (26 years if it's data related to the medical record). We're already in the 100's of TB's of storage. Let's just keep both data centers we can manage here.

Moving back from the cloud is much more costly than moving into it. That's on purpose.
 
Upvote
108 (108 / 0)

Fred Duck

Ars Tribunus Angusticlavius
7,332
reads about two Enactors enacting on the same data at the same time

Ah, yes, reminds me of the fun we've had group collaborating on shared documents. Cheers, cloud, you make everything better.

Too many companies decided they couldn't be bothered maintaining their own servers and paid for it to be Somebody Else's Problem. Some have commented in earlier articles that well, AWS have good uptime overall but who really wants to be at the mercy of their vendor's competence?

Even worse, when much of the Internet was inaccessible, some people had to go outside into the sunshine because there was nothing else to do! We can't have that.
 
Upvote
20 (28 / -8)
The real problem is that many companies end up relying on us-east-1 because it's the most economically place to host if you're trying to serve the US and Europe. Every single one of these companies should of bean ready to scale in Ohio but you'd be shocked how many companies aren't able to because their microservices architecture is a bunch of bespoke bullshit glued together with popsicle sticks and cant rapidly scale up from a dormant or less-busy region.

Most people running schlocky SAAS don't care if they go down during a major AWS even in us-east. They're not that important. That many banks online services failed is a bit... shocking. Spending too much money on stock buybacks there C-Suite?
 
Upvote
37 (37 / 0)
My staff and I were in our weekly standup Wednesday morning just after the root cause was announced. The boss hadn't seen it yet and asked if there were any updates.

Me, deadpan: "DNS."

Everyone else laughed because, well.

Me: not laughing

Everyone: "...wait, seriously?"

I have a print of "Days Since It Was DNS: 0" on my cube wall for a reason, kids.
 
Upvote
91 (91 / 0)