A DNS manager in a single region of Amazon's sprawling network touched off a 16-hour debacle.
See full article...
See full article...
At least it was before they really quit enforcement of antitrust law. Now we're in monopsony where two or three large corporations own the infrastructure of everything.Oh sure, I mean... HUH?
I have encountered the limit of my networking knowledge. Most of this was Greek to me.
I thought the entire fundamental structure of the Internet was intended to avoid single points of failure?
So many issues with diesel backups. The problem at my dad's company was that people kept siphoning the diesel from the tank so it was always empty when they needed it. It took them years of this to finally buy a fence to put around it.So, a "successful" test --i.e. it found the problem!
"Sooooo.....how was your first code commit on your first day at Amazon...?"
Split brain is when you have a clustered system, and you lose network connectivity such that you then get 2 independent clusters.Not the way I learned it. From the Wikipedia article on race conditions (highlighting mine):
Again: a time-of-check to time-of-use bug in multiprocessing is the canonical example of the error. I'm not familiar with 'split brain', but that might be a special case of a race condition.
Aha, thank you. So that's something else entirely.Split brain is when you have a clustered system, and you lose network connectivity such that you then get 2 independent clusters.
E.g. the cluster elects a master node. At some point, some of the cluster nodes lose connectivity with the current master, and they then decide to elect a new one. So you now have 2 master nodes both trying to run the system.
Many who consider themselves concurrency experts have made, and will continue to make, this blunder.So, an undergrad level concurrency blunder.
They were assuming multiple versions could run at once. That's why they first tested to make sure that the new plan was newer than what was on the system, and then pushed the update. And that's where the race condition was, because the time-of-check to time-of-update could be any arbitrary amount of time, which wasn't accounted for in the design. If they wanted to use that method, the update needed to be an atomic operation; replace the old plan if the new one is newer, succeed or fail in a single operation.
Atomic operations (called transactions) exist in at least PostgreSQL. I imagine most advanced databases have them, but my DB knowledge is extremely superficial.
This largely comes down to a few things:Why are we blaming customers here? The customers in us-east-1 were the most impacted, yes, but the fault here is totally on Amazon. Why were customers in Germany, the UK, and basically the whole world impacted by a failure in a us region?
This basically showed, once again, that while AWS reps are selling you multi-region, multi-AZ and a bunch of stuff for redundancy, their internal core services are still pinned to the us-east-1 region, and if that region goes down, it doesnt matter where you are or your redundancy setup, you are going down.
Transactions have been a thing for DB2, Oracle, and Microsoft SQL Server for as long as I can remember (and my professional career dates back to the late 90s.) I remember that there were other DBMS options back then, but I can’t be bothered looking them up.Atomic operations (called transactions) exist in at least PostgreSQL. I imagine most advanced databases have them, but my DB knowledge is extremely superficial.
The DNS protocol here is not really relevant to the question of transactions. The issue was with whatever database was being used to store the network plans from which the DNS answers are derived.Transactions have been a thing for DB2, Oracle, and Microsoft SQL Server for as long as I can remember (and my professional career dates back to the late 90s.) I remember that there were other DBMS options back then, but I can’t be bothered looking them up.
The problem here is that DNS doesn’t have a formal specification for atomic updates. There are various ways to update DNS zone records, but doing them at AWS scales is not an off the shelf problem. So AWS has its own specialised method for this… and now they’ve discovered one of the problems with their method.
Let us now repeat the sysadmin mantra: “All software sucks. All hardware sucks.”
Kinda sorta.The issue was with whatever database was being used to store the network plans from which the DNS answers are derived.
They're using DNS to rewire the configuration as part of load balancing. Shouldn't that be two layers? DNS should be about which piece of hardware is where in IP space. There should be another logical layer dealing with the roles required in the distributed system. I'm amazed anyone is doing this.If I read the incident report correctly, one of the problems here is that the DNS updating system itself relies on the DNS it's updating to be working correctly. So if, as happened here, it blows up DNS, it then can't fix DNS because it can't find what it's supposed to update.
I suspect that's going to be one of the things they'll need to rethink. I've been trying to think of a solution I really like, and haven't come up with anything really good. One thought is a two-tier system, where the master servers just track DNS servers that should get updated, probably out of a different domain. But Amazon doesn't want to do things manually, so they'll probably need an update script for the master servers, and they might end up right back in the same problem again.
Maybe they'll need to make a 'bootstrap' process that can automatically roll out the changes needed to hard start the whole DNS server complex from zero. But then they might lose track of all the old virtual servers and the old virtual services, which is another mess.....
I haven't spent a lot of time with this or anything, but at least to a two-minute think, this problem looks brutally hard to solve in a truly bulletproof way, or at least a very-fast-recovery-from-disaster way.
Technically seems less like a race condition and more like poor checking or process barriers on replacing the DNS tables. I guess it could be threading, but 'delays' seems a lot more like "we are having to wait on API calls/etc" rather than "we have a multi-threaded process running a lot of asynchronous tasks" (which would be a race condition).
And really, in terms of function Reddit is like Usenet ported to the web. But the nature of the web very strongly favors centralization rather than Usenet's federated approach.Here I though that the web/internet was intended for free flowing information without being centralized so that if one place went down, it wouldn't impact the rest of the system.
But the more and more the entire web is dependent on some few massive services we'll get outages like these.
One of the reasons I despise reddit. Having so many subjects dependent on this single platform where they dictate everything. I used to love going to forums but those are very few now.
Content delivery networks have done some version of automating DNS changes for load balancing for decades. They see the source IP addresses of incoming DNS requests from users’ ISP’s DNS servers, look up where such DNS requests’ source IP addresses are likely located, and then send responses to the requests that point the requests to an active CDN server that is close to the DNS server’s apparent location. Should a CDN content server fail or get highly loaded, the CDN’s DNS server will then reply with another CDN content server’s IP address.
They were assuming multiple versions could run at once. That's why they first tested to make sure that the new plan was newer than what was on the system, and then pushed the update. And that's where the race condition was, because the time-of-check to time-of-update could be any arbitrary amount of time, which wasn't accounted for in the design. If they wanted to use that method, the update needed to be an atomic operation; replace the old plan if the new one is newer, succeed or fail in a single operation.
Atomic operations (called transactions) exist in at least PostgreSQL. I imagine most advanced databases have them, but my DB knowledge is extremely superficial.
You have to have real business buy in to effectively test different DR scenarios. It can be hard for us to test during our scheduled downtimes because there is so much stuff that has to be done during those periods.
No. You can sue AWS for losses, just like you can sue anyone for anything.Fixing this issue means spending even more time and money validating software which is not something the stockholders want to hear about. Of course, if all the users sue for losses which will cost more money? It will be all about the balance sheet.
I'd think a serious peer or near-peer cyberattack would be a bit more nuanced, at least at first. Perhaps sow disinformation in a variety of places, deliberately contradictory and intended to confuse and/or generate anxiety. Then sever or disrupt pre-targeted channels of communication, and so on. Poison or otherwise interfere with GPS. I figure it would be ugly, and not simply aim for bringing down data centers.I guess we now know what data center a foreign actor will strike first if it comes to a large scale politically motivated cyber attack.
And cross-region replication is consistent and performant for which service, exactly?We are 100% on the AWS and we survived this event just fine, without a single hiccup. It's not even that hard: create a replicated database in another region and have a simple DNS-based failover ready.
We also don't run anything in us-east-1, on purpose.
Postgres does not scale to dynamodb scale.They were assuming multiple versions could run at once. That's why they first tested to make sure that the new plan was newer than what was on the system, and then pushed the update. And that's where the race condition was, because the time-of-check to time-of-update could be any arbitrary amount of time, which wasn't accounted for in the design. If they wanted to use that method, the update needed to be an atomic operation; replace the old plan if the new one is newer, succeed or fail in a single operation.
Atomic operations (called transactions) exist in at least PostgreSQL. I imagine most advanced databases have them, but my DB knowledge is extremely superficial.
Yeah, ppl mad, but you're not wrong - it's like the famous "transaction 101" problem - a withdrawal from an ATM.I dont know - I am in healthcare and we deal with single points of failure that fail all of the time and we just deal with it (and yes, sometimes people die - or worse). I am not saying that it is good, but there is just so much that we can do - even with endless resources, which we clearly dont have, to mitigate against everything....
heck, we only have 1 sun keeping the solar system warm, what is our back-up plan for when that back-up generator gets unplugged
Oof. I had friend who rebuilt an old 1930s car. Clearly had the knowledge and the skills - he even managed to finagle an ABS into it.One company I worked for, we had to do 1000 hour approval tests on certain electronic components. No worries, UPS and big Diesel generator which, in fact, you could see from one side of the lab block.
We were told there was going to be a live system test, power would go off, UPS come on, then Diesel, then power would be restored. So, we're ready for that.
Power goes off, local UPS comes on line then a few seconds later the main UPS takes over. The Diesel starts up. After a few minutes it produces white smoke. Then there is a tremendous bang, the Diesel stops, the UPS resumes - and runs down before the clown in charge gets to restore mains power.
It does help to remember when the Diesel is overhauled, to put in the new oil after draining the old oil.
Yes, for this system, at AWS scale, you want multiple updaters. Seemless failover, performance, etc.I'd agree that the race would be closer to the root cause of the outage. However, a check to prevent more than one updater from being active simultaneously should have prevented further propagation of the damage. That is, unless multiple updater instances were an intentional part of the design, with some sort of mechanism to share state between them?
Admittedly, there is a great deal about this system of which I am woefully ignorant.![]()
No.Why are we blaming customers here? The customers in us-east-1 were the most impacted, yes, but the fault here is totally on Amazon. Why were customers in Germany, the UK, and basically the whole world impacted by a failure in a us region?
This basically showed, once again, that while AWS reps are selling you multi-region, multi-AZ and a bunch of stuff for redundancy, their internal core services are still pinned to the us-east-1 region, and if that region goes down, it doesnt matter where you are or your redundancy setup, you are going down.
Software systems are fragile.The cloud is fragile, to make matters worse you really need to use multiple vendors to be safe . There was the pension provide in Australia who has their account accentually deleted by their cloud provider and total loss of their data on that provider. Opps.
Due to difference in how services are provided it would be extremely difficult to have fault tolerance over multiple vendors.
Using cloud providers creates a lots of risks that are poorly managed by most companies.
Everything works well until it doesn’t.
Best advice I got was to write "Check the oil" across the speedometer in wax pen.Oof. I had friend who rebuilt an old 1930s car. Clearly had the knowledge and the skills - he even managed to finagle an ABS into it.
Spent a couple years on it.
Whiffed at the end by forgetting the oil.
Was too depressed to go on after that.
Checklists are good. Having someone check your work, no matter how good you are - even better.
Dusfud how long you been working? 10 years tops?No. You can sue AWS for losses, just like you can sue anyone for anything.
But, you're not going to win. Also, AWS is free to drop you as a customer if you sue them. They may not though, because they know you'll lose, and AWS is always happy to take anyone's money.
You might get credit back, but AWS's SLAs are less concrete then it appears on their marketing material.
That said, AWS is fairly good at comping folks when they genuinely mess up, regardless of SLA terms. But they have to mess up in a noteworthy way.
And, in the end, your other options are not great:
- MS cloud is a joke, unless you really don't care about security, at all. Or actual elasticity at scale.
- Google has some nice (often superior) products (e.g. Spanner, which has zero equivalent offerings elsewhere). But when anything breaks - good luck - their support is pretty much a bunch of "works for me, you must done something wrong" tech bros. They have some issues scaling too. I've had them say they may not have enough BigTable nodes unless you reserve ahead of time..
- Do it yourself. Unless you have a technically trivial system (e.g. if a replicated relational DB and k8s cluster suffices it is trivial) .. good luck doing it yourself.
That's not a problem with diesel backup, that's a problem with people stealing.So many issues with diesel backups. The problem at my dad's company was that people kept siphoning the diesel from the tank so it was always empty when they needed it. It took them years of this to finally buy a fence to put around it.