A single point of failure triggered the Amazon outage affecting millions

Sadre

Ars Scholae Palatinae
1,013
Subscriptor
During the late 90's and early 2000's, "I think a router went down in x" was sometimes a nightly convo between frustrated gamers, one of whom would be on the phone with person y who had disappeared along with a chunk of other people, during a dragon killing expedition.

The conspiracies never occurred to us, because, well, dragon killing expedition.
 
Upvote
3 (3 / 0)

foroneus

Smack-Fu Master, in training
3
I wonder what the estimated losses would be if this had been a cyberattack, for sure it would register as CMC Category 3-4 on critical infrastructure severity scales, right? I wonder the estimated economic impact throughout the global supply chain?

This fiasco proved that "multi-region" architecture sometimes still means one region acting like Trump thank you, US-EAST-1, proven global economy disruptor.
 
Upvote
4 (4 / 0)
Oh sure, I mean... HUH?

I have encountered the limit of my networking knowledge. Most of this was Greek to me.

I thought the entire fundamental structure of the Internet was intended to avoid single points of failure?
At least it was before they really quit enforcement of antitrust law. Now we're in monopsony where two or three large corporations own the infrastructure of everything.
 
Upvote
0 (3 / -3)

Architect_of_Insanity

Ars Tribunus Militum
2,160
Subscriptor++
Its-not-DNS.-There-is-no-wayits-DNS.-It-was-DNS.jpeg
I keep this one bookmarked for such occasions: https://dnshaiku.com
 
Upvote
4 (4 / 0)
So, a "successful" test --i.e. it found the problem!
So many issues with diesel backups. The problem at my dad's company was that people kept siphoning the diesel from the tank so it was always empty when they needed it. It took them years of this to finally buy a fence to put around it.
 
Upvote
13 (14 / -1)

EllPeaTea

Ars Praefectus
12,043
Subscriptor++
Not the way I learned it. From the Wikipedia article on race conditions (highlighting mine):


Again: a time-of-check to time-of-use bug in multiprocessing is the canonical example of the error. I'm not familiar with 'split brain', but that might be a special case of a race condition.
Split brain is when you have a clustered system, and you lose network connectivity such that you then get 2 independent clusters.
E.g. the cluster elects a master node. At some point, some of the cluster nodes lose connectivity with the current master, and they then decide to elect a new one. So you now have 2 master nodes both trying to run the system.
 
Upvote
20 (20 / 0)

k h

Ars Centurion
389
Subscriptor
Reading the AWS post mortem document, I don't see a single point of failure. This was a distributed failure.

The system was built with a high degree of redundancy for fault tolerance, and parallelism for performance. The redundant self-healing autonomous actors interacted in a combination that the designers had not foreseen. They got in each other's way, tripped over one another, and collided in a distributed trainwreck.

As usual in systems highly engineered for reliability, the collapse came from the conjunction of multiple things going wrong. Multiple design assumptions turned out to be false, all at the same time.

Fault-tolerance mechanisms generally are designed to recover from a single-point failure. Rarely are they designed or tested to cope with multiple concurrent failures. When redundant nodes independently detect faults, and their fault recovery modes all activate and begin to interact, they are likely to drive the system into a regime that is inadequately tested (and often, untestable) resulting in behavior that takes the designers by surprise.
 
Upvote
23 (23 / 0)
Split brain is when you have a clustered system, and you lose network connectivity such that you then get 2 independent clusters.
E.g. the cluster elects a master node. At some point, some of the cluster nodes lose connectivity with the current master, and they then decide to elect a new one. So you now have 2 master nodes both trying to run the system.
Aha, thank you. So that's something else entirely.
 
Upvote
4 (4 / 0)

TenacityOverAptitude

Ars Centurion
209
Subscriptor++
They were assuming multiple versions could run at once. That's why they first tested to make sure that the new plan was newer than what was on the system, and then pushed the update. And that's where the race condition was, because the time-of-check to time-of-update could be any arbitrary amount of time, which wasn't accounted for in the design. If they wanted to use that method, the update needed to be an atomic operation; replace the old plan if the new one is newer, succeed or fail in a single operation.

Atomic operations (called transactions) exist in at least PostgreSQL. I imagine most advanced databases have them, but my DB knowledge is extremely superficial.

I worked implementing relational databases for most of my career, starting in 1982. Every one of them had atomic transactions. After a while, most allowed you to say that you didn’t need full ACID behavior: atomic, isolatable, durable and consistent transactions. That generally increased concurrency — and application complexity as “weird” things could be observed.

The “relaxed consistency” models used in cloud databases give me the Willies. I’m amazed it works as well as it does. Lost or conflicted updates are expected.
 
Upvote
13 (13 / 0)

agentofN0thing

Smack-Fu Master, in training
2
Why are we blaming customers here? The customers in us-east-1 were the most impacted, yes, but the fault here is totally on Amazon. Why were customers in Germany, the UK, and basically the whole world impacted by a failure in a us region?

This basically showed, once again, that while AWS reps are selling you multi-region, multi-AZ and a bunch of stuff for redundancy, their internal core services are still pinned to the us-east-1 region, and if that region goes down, it doesnt matter where you are or your redundancy setup, you are going down.
This largely comes down to a few things:
  1. US-EAST-1 has certain control planes available to it that are not available in other regions, thanks to it's status as the official "govcloud." This makes services hosted in EAST-1 easier to maintain and control.
  2. Geographic proximity to the highest-density population in North America, as well as reasonable proximity to Europe through sub-sea cabling makes it advantageous for latency-sensitive applications for the highest concentration of end-users.
  3. Because the datacenters in East are more established, there are fewer upfront infrastructure deployment costs to subsidize, and combined with a more favorable regulatory framework for those datacenters, lower costs to host applications and services there.
For companies looking to start their journey into the cloud, these factors make US-EAST-1 an attractive first step, and that convenience can be a trap if they are not also simultaneously planning for a distributed approach in the future that guides their architectural decisions.
 
Upvote
2 (3 / -1)
Atomic operations (called transactions) exist in at least PostgreSQL. I imagine most advanced databases have them, but my DB knowledge is extremely superficial.
Transactions have been a thing for DB2, Oracle, and Microsoft SQL Server for as long as I can remember (and my professional career dates back to the late 90s.) I remember that there were other DBMS options back then, but I can’t be bothered looking them up.

The problem here is that DNS doesn’t have a formal specification for atomic updates. There are various ways to update DNS zone records, but doing them at AWS scales is not an off the shelf problem. So AWS has its own specialised method for this… and now they’ve discovered one of the problems with their method.

Let us now repeat the sysadmin mantra: “All software sucks. All hardware sucks.”
 
Upvote
8 (8 / 0)
Transactions have been a thing for DB2, Oracle, and Microsoft SQL Server for as long as I can remember (and my professional career dates back to the late 90s.) I remember that there were other DBMS options back then, but I can’t be bothered looking them up.

The problem here is that DNS doesn’t have a formal specification for atomic updates. There are various ways to update DNS zone records, but doing them at AWS scales is not an off the shelf problem. So AWS has its own specialised method for this… and now they’ve discovered one of the problems with their method.

Let us now repeat the sysadmin mantra: “All software sucks. All hardware sucks.”
The DNS protocol here is not really relevant to the question of transactions. The issue was with whatever database was being used to store the network plans from which the DNS answers are derived.
I don’t know what AWS uses for that internally, but I wouldn’t be surprised if it was in one of their own DynamoDB services. And DynamoDB doesn’t have the same level of transaction support as a full relational database (despite what an AI summary tried to tell me when I did a quick search).

A full relational/SQL database allows you to start a transaction, then perform a set of read and write operations, and then complete the transaction. If the transaction succeeds then all operations inside succeed and no conflicts with data from other clients of the database have been allowed to occur.

DynamoDB and other more relaxed databases drop the fancy multi-statement transaction control and shift the burden of managing data consistency from the database to the application. The benefit they get is that they can be faster and (much) more scalable.
 
Upvote
10 (10 / 0)
The issue was with whatever database was being used to store the network plans from which the DNS answers are derived.
Kinda sorta.

It's a combination of the database (which probably isn't an SQL database) storing the network plans; how those plans are distributed to workers for application; and how the application of the plans is done. To do all of this successfully at AWS scales requires atomicity at the point of applying the plans: making sure that your information is later than what's already applied, and that nothing else is interfering with it; and, when you delete the old information, making sure that your newer information hasn't been overridden by older information.

How they achieve that is entirely up to AWS. But it's very clear that, at least to date, they didn't fully achieve it.

Robustness in this sort of stuff is hard. It gets even harder when you're dealing with the sort of scale that AWS has.

It genuinely would not surprise me to learn, at some point in the distant future, that a fix applied to resolve one issue ends up creating another (or regressing so that an older issue that had previously been fixed re-appears.) Especially since the sort of institutional memory that would avoid such a scenario is being tossed out every time AWS decides to "right size" their workforce.
 
Upvote
9 (9 / 0)

kaleberg

Ars Scholae Palatinae
1,266
Subscriptor
If I read the incident report correctly, one of the problems here is that the DNS updating system itself relies on the DNS it's updating to be working correctly. So if, as happened here, it blows up DNS, it then can't fix DNS because it can't find what it's supposed to update.

I suspect that's going to be one of the things they'll need to rethink. I've been trying to think of a solution I really like, and haven't come up with anything really good. One thought is a two-tier system, where the master servers just track DNS servers that should get updated, probably out of a different domain. But Amazon doesn't want to do things manually, so they'll probably need an update script for the master servers, and they might end up right back in the same problem again.

Maybe they'll need to make a 'bootstrap' process that can automatically roll out the changes needed to hard start the whole DNS server complex from zero. But then they might lose track of all the old virtual servers and the old virtual services, which is another mess.....

I haven't spent a lot of time with this or anything, but at least to a two-minute think, this problem looks brutally hard to solve in a truly bulletproof way, or at least a very-fast-recovery-from-disaster way.
They're using DNS to rewire the configuration as part of load balancing. Shouldn't that be two layers? DNS should be about which piece of hardware is where in IP space. There should be another logical layer dealing with the roles required in the distributed system. I'm amazed anyone is doing this.

This is OS\360 logical/physical stuff.
 
Upvote
1 (4 / -3)

SeanJW

Ars Legatus Legionis
11,947
Subscriptor++
Technically seems less like a race condition and more like poor checking or process barriers on replacing the DNS tables. I guess it could be threading, but 'delays' seems a lot more like "we are having to wait on API calls/etc" rather than "we have a multi-threaded process running a lot of asynchronous tasks" (which would be a race condition).

It's absolutely a race. Between the first enactor which checked timestamps, and the actual action (a TOCTOU race) the second enactor snuck in, did its thing, and all the conditions the first enactor relied upon were void.
 
Upvote
7 (7 / 0)
Here I though that the web/internet was intended for free flowing information without being centralized so that if one place went down, it wouldn't impact the rest of the system.

But the more and more the entire web is dependent on some few massive services we'll get outages like these.

One of the reasons I despise reddit. Having so many subjects dependent on this single platform where they dictate everything. I used to love going to forums but those are very few now.
And really, in terms of function Reddit is like Usenet ported to the web. But the nature of the web very strongly favors centralization rather than Usenet's federated approach.
 
Upvote
2 (2 / 0)

SeanJW

Ars Legatus Legionis
11,947
Subscriptor++
Content delivery networks have done some version of automating DNS changes for load balancing for decades. They see the source IP addresses of incoming DNS requests from users’ ISP’s DNS servers, look up where such DNS requests’ source IP addresses are likely located, and then send responses to the requests that point the requests to an active CDN server that is close to the DNS server’s apparent location. Should a CDN content server fail or get highly loaded, the CDN’s DNS server will then reply with another CDN content server’s IP address.

They're even more complicated than that. They'll do the DNS change, but DNS is cached. They can't prevent that (TTLs are regularly ignored). So they'll also push a route advertisement change so anything using the old IPs go to the new destination as well.
 
Upvote
3 (3 / 0)

SeanJW

Ars Legatus Legionis
11,947
Subscriptor++
They were assuming multiple versions could run at once. That's why they first tested to make sure that the new plan was newer than what was on the system, and then pushed the update. And that's where the race condition was, because the time-of-check to time-of-update could be any arbitrary amount of time, which wasn't accounted for in the design. If they wanted to use that method, the update needed to be an atomic operation; replace the old plan if the new one is newer, succeed or fail in a single operation.

Atomic operations (called transactions) exist in at least PostgreSQL. I imagine most advanced databases have them, but my DB knowledge is extremely superficial.

Distributed databases generally don't - they're eventually consistent, and you design around it, because global locking is (1) hard (2) a massive latency inducing issue.

Edit: That's not to say global locking (and ACID for that matter) isn't available. But they're the last choice you use for very good reasons.
 
Last edited:
Upvote
9 (9 / 0)

qibhom

Wise, Aged Ars Veteran
139
Subscriptor
You have to have real business buy in to effectively test different DR scenarios. It can be hard for us to test during our scheduled downtimes because there is so much stuff that has to be done during those periods.

When I worked for the Federal Courts, we were required by Congress to have failovers and a valid DR scenario. Our vendor was shocked when I insisted on literally pulling the fiber cable out of the server.

The Court libraries were warned in advance, we picked a slow time, and everyone was all for this. We learned a ton about failure and recovery, and so did the vendor. Later, the vendor thanked me.

Now, it isn't the end of the world if the Federal Court library system goes down, but everyone from the vendor to front line staff was prepared. As the sysadmin, I felt comfortable knowing exactly what had to happen if a catastrophic failure took place. So did my Courts library staff.

Was it a pain? Certainly. But the confidence it gave all of us in our DR procedures was priceless.
 
Upvote
19 (19 / 0)
Fixing this issue means spending even more time and money validating software which is not something the stockholders want to hear about. Of course, if all the users sue for losses which will cost more money? It will be all about the balance sheet.
No. You can sue AWS for losses, just like you can sue anyone for anything.

But, you're not going to win. Also, AWS is free to drop you as a customer if you sue them. They may not though, because they know you'll lose, and AWS is always happy to take anyone's money.

You might get credit back, but AWS's SLAs are less concrete then it appears on their marketing material.

That said, AWS is fairly good at comping folks when they genuinely mess up, regardless of SLA terms. But they have to mess up in a noteworthy way.

And, in the end, your other options are not great:

  • MS cloud is a joke, unless you really don't care about security, at all. Or actual elasticity at scale.
  • Google has some nice (often superior) products (e.g. Spanner, which has zero equivalent offerings elsewhere). But when anything breaks - good luck - their support is pretty much a bunch of "works for me, you must done something wrong" tech bros. They have some issues scaling too. I've had them say they may not have enough BigTable nodes unless you reserve ahead of time..
  • Do it yourself. Unless you have a technically trivial system (e.g. if a replicated relational DB and k8s cluster suffices it is trivial) .. good luck doing it yourself.
 
Upvote
1 (1 / 0)

dzid

Ars Centurion
3,373
Subscriptor
I guess we now know what data center a foreign actor will strike first if it comes to a large scale politically motivated cyber attack.
I'd think a serious peer or near-peer cyberattack would be a bit more nuanced, at least at first. Perhaps sow disinformation in a variety of places, deliberately contradictory and intended to confuse and/or generate anxiety. Then sever or disrupt pre-targeted channels of communication, and so on. Poison or otherwise interfere with GPS. I figure it would be ugly, and not simply aim for bringing down data centers.

Ed: spelling
 
Upvote
0 (0 / 0)
We are 100% on the AWS and we survived this event just fine, without a single hiccup. It's not even that hard: create a replicated database in another region and have a simple DNS-based failover ready.

We also don't run anything in us-east-1, on purpose.
And cross-region replication is consistent and performant for which service, exactly?

It's easy for trivial systems. Which most are, but that's because they're trivial.

Getting it right for something that needs consistency and (scalable) performance is a bit harder.
 
Upvote
2 (2 / 0)
They were assuming multiple versions could run at once. That's why they first tested to make sure that the new plan was newer than what was on the system, and then pushed the update. And that's where the race condition was, because the time-of-check to time-of-update could be any arbitrary amount of time, which wasn't accounted for in the design. If they wanted to use that method, the update needed to be an atomic operation; replace the old plan if the new one is newer, succeed or fail in a single operation.

Atomic operations (called transactions) exist in at least PostgreSQL. I imagine most advanced databases have them, but my DB knowledge is extremely superficial.
Postgres does not scale to dynamodb scale.
 
Upvote
4 (4 / 0)
I dont know - I am in healthcare and we deal with single points of failure that fail all of the time and we just deal with it (and yes, sometimes people die - or worse). I am not saying that it is good, but there is just so much that we can do - even with endless resources, which we clearly dont have, to mitigate against everything....

heck, we only have 1 sun keeping the solar system warm, what is our back-up plan for when that back-up generator gets unplugged
Yeah, ppl mad, but you're not wrong - it's like the famous "transaction 101" problem - a withdrawal from an ATM.

Turns out banks do none of that transaction business for ATM withdrawals - they just cap & rate limit withdrawals, and, do very eventually consistent withdrawals.

When you get into actual life-safety (or finance) systems, the solutions are highly regulated, and often not what distributed system folks teach.
 
Upvote
2 (2 / 0)
One company I worked for, we had to do 1000 hour approval tests on certain electronic components. No worries, UPS and big Diesel generator which, in fact, you could see from one side of the lab block.

We were told there was going to be a live system test, power would go off, UPS come on, then Diesel, then power would be restored. So, we're ready for that.
Power goes off, local UPS comes on line then a few seconds later the main UPS takes over. The Diesel starts up. After a few minutes it produces white smoke. Then there is a tremendous bang, the Diesel stops, the UPS resumes - and runs down before the clown in charge gets to restore mains power.

It does help to remember when the Diesel is overhauled, to put in the new oil after draining the old oil.
Oof. I had friend who rebuilt an old 1930s car. Clearly had the knowledge and the skills - he even managed to finagle an ABS into it.

Spent a couple years on it.

Whiffed at the end by forgetting the oil.

Was too depressed to go on after that.

Checklists are good. Having someone check your work, no matter how good you are - even better.
 
Upvote
10 (10 / 0)
I'd agree that the race would be closer to the root cause of the outage. However, a check to prevent more than one updater from being active simultaneously should have prevented further propagation of the damage. That is, unless multiple updater instances were an intentional part of the design, with some sort of mechanism to share state between them?

Admittedly, there is a great deal about this system of which I am woefully ignorant. 🤷‍♂️
Yes, for this system, at AWS scale, you want multiple updaters. Seemless failover, performance, etc.
 
Upvote
2 (2 / 0)
Why are we blaming customers here? The customers in us-east-1 were the most impacted, yes, but the fault here is totally on Amazon. Why were customers in Germany, the UK, and basically the whole world impacted by a failure in a us region?

This basically showed, once again, that while AWS reps are selling you multi-region, multi-AZ and a bunch of stuff for redundancy, their internal core services are still pinned to the us-east-1 region, and if that region goes down, it doesnt matter where you are or your redundancy setup, you are going down.
No.

Customers - of products running on AWS - saw outages only if those products were us-east-1 only

We have multiple regions.

Yes, us-east-1 had outages. Other regions were fine.

Our failover was fine.
Did we have systems that were us-east-1 only because they weren't that critical? Yes
Did they break? Yes.
Was recovery annoying? Yes.

Did our main systems go down? No.
And a lot of our stuff is not even really properly multi-region.

This did not take down their "core systems" because they are giant provider that operate global, redundant, distributed systems. There are no singular "core systems".

AWS has many flaws. Being unable to tolerate regional failures is not one of them.

If AWS was as you described, Amazon would have gone down. It didn't.
 
Upvote
7 (7 / 0)
The cloud is fragile, to make matters worse you really need to use multiple vendors to be safe . There was the pension provide in Australia who has their account accentually deleted by their cloud provider and total loss of their data on that provider. Opps.

Due to difference in how services are provided it would be extremely difficult to have fault tolerance over multiple vendors.

Using cloud providers creates a lots of risks that are poorly managed by most companies.

Everything works well until it doesn’t.
Software systems are fragile.

One good thing about AWS is their support doesn't have the access to delete all your files.

One bad thing about AWS is their support doesn't have access. So I've been reduced to screenshoting cloudwatch metrics on a managed service because even our TAM can't see them.

Google, which deleted aforementioned pension fund, does things their Googly way. And did have access.

Also, note, in that case, the data was recovered.

And before the cloud, I heard plenty of stories about a stray space in an rm -rf / path/, catting dev/random to your hard disk by accident. Whoopsies with a where clause on a delete statement.

You need your infra maintained by a knowledgeable, competent, and humble team. How you get that team may vary. And test your backups. Always test your backups.
 
Upvote
7 (7 / 0)

Erbium168

Ars Centurion
2,841
Subscriptor
Oof. I had friend who rebuilt an old 1930s car. Clearly had the knowledge and the skills - he even managed to finagle an ABS into it.

Spent a couple years on it.

Whiffed at the end by forgetting the oil.

Was too depressed to go on after that.

Checklists are good. Having someone check your work, no matter how good you are - even better.
Best advice I got was to write "Check the oil" across the speedometer in wax pen.
 
Upvote
7 (7 / 0)

koolraap

Ars Tribunus Militum
2,236
No. You can sue AWS for losses, just like you can sue anyone for anything.

But, you're not going to win. Also, AWS is free to drop you as a customer if you sue them. They may not though, because they know you'll lose, and AWS is always happy to take anyone's money.

You might get credit back, but AWS's SLAs are less concrete then it appears on their marketing material.

That said, AWS is fairly good at comping folks when they genuinely mess up, regardless of SLA terms. But they have to mess up in a noteworthy way.

And, in the end, your other options are not great:

  • MS cloud is a joke, unless you really don't care about security, at all. Or actual elasticity at scale.
  • Google has some nice (often superior) products (e.g. Spanner, which has zero equivalent offerings elsewhere). But when anything breaks - good luck - their support is pretty much a bunch of "works for me, you must done something wrong" tech bros. They have some issues scaling too. I've had them say they may not have enough BigTable nodes unless you reserve ahead of time..
  • Do it yourself. Unless you have a technically trivial system (e.g. if a replicated relational DB and k8s cluster suffices it is trivial) .. good luck doing it yourself.
Dusfud how long you been working? 10 years tops?
 
Upvote
2 (2 / 0)

just.Joe

Smack-Fu Master, in training
76
So many issues with diesel backups. The problem at my dad's company was that people kept siphoning the diesel from the tank so it was always empty when they needed it. It took them years of this to finally buy a fence to put around it.
That's not a problem with diesel backup, that's a problem with people stealing.
Diesel backup is reliable as backup can be. Maintenance? Oh yes, has to be done. Testing regularly? 100% necesary. But it's cheap and it works.
The two cases beeing shared here show 1. a problem in design with the pump not beeing connected (not a diesel generator problem) and 2. a giant fuckup with maintenance.
None of this is "a diesel generator problem" as you make it sound. As far as backup emergeny electric system go, diesel backup generator is the best we have at the moment, had for 100 years and will have for another 100 :)
 
Last edited:
Upvote
0 (1 / -1)