A DNS manager in a single region of Amazon's sprawling network touched off a 16-hour debacle.
See full article...
See full article...
As the enactor operated, it “experienced unusually high delays needing to retry its update on several of the DNS endpoints.”
Is a circuit breaker a reasonable analogy?...not zero failure but contained failure
* snickers *“The way forward,” Ookla said, “is not zero failure but contained failure, achieved through multi-region designs, dependency diversity, and disciplined incident readiness, with regulatory oversight that moves toward treating the cloud as systemic components of national and economic resilience.
As long as you are up to date with your quarterly patches, your mandatory patches, patch molecules, weblogic patches, opatch patch installer patches, and the 27 different Java patches, you should be fine."No worries, mate! We're in Oracle Cloud!"
Content delivery networks have done some version of automating DNS changes for load balancing for decades. They see the source IP addresses of incoming DNS requests from users’ ISP’s DNS servers, look up where such DNS requests’ source IP addresses are likely located, and then send responses to the requests that point the requests to an active CDN server that is close to the DNS server’s apparent location. Should a CDN content server fail or get highly loaded, the CDN’s DNS server will then reply with another CDN content server’s IP address.Automating DNS changes for load balancing seems like a bad idea in the first place...
You have to have real business buy in to effectively test different DR scenarios. It can be hard for us to test during our scheduled downtimes because there is so much stuff that has to be done during those periods."The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design."
Unfortunately, economics works directly against this. Unless the cost of the occasional failure is very high, it can frequently make more sense to avoid the complexity & cost of redundant systems and all the extra design/maintenance/testing/admin they require. Eliminating single points of failure is HARD and you pretty much always miss something that doesn't show up in controlled testing.
I haven't seen any explanation anywhere of part of the root cause (in addition to the race condition):
What caused the "unusually high delays" to the first enactor?
Heh, from first hand experience I can tell you that you won't be committing code on your first day."Sooooo.....how was your first code commit on your first day at Amazon...?"
I've always wondered about the danger of systems that spawn processors/threads on demand. Is this not a situation that is considered - the "runaway" processor spawn, with each new processor colliding with the ones already operating? Especially when the new process spawns because the existing process is having an issue? Then two have an issue so a third begins, ad infinitum.
Hard to identify is certainly not the same as failing to validate that your CoB plan works.Even in my little home network, I thought I had eliminated single point of failure. Dual ISPs to a router capable of auto failover, with a second router in shadow mode that could auto take over if the first router failed. Perfect.
Nope. Fiber cable out of first router to core switch failed. Shadow router did not take over because the failure was downstream.
Then again - I have a core switch as well, no failover there, although it's pretty easy to manually bypass.
Single point failures are sometimes hard to identify.
I had an HA setup at one point where we thought we'd identified and eliminated all the SPOFs for critical services and tested thoroughly. Then we had a real power outage, and everything worked perfectly - UPSs carried the load until the generator spun up, ATS worked smoothly, all the various networks bits (including the upstream carrier stuff) failed over, notices went out, etc. The decision was made not to activate the DR site because the redundancy built in to the primary handled it.You have to have real business buy in to effectively test different DR scenarios. It can be hard for us to test during our scheduled downtimes because there is so much stuff that has to be done during those periods.
Thank you, Dan. This is the first article I've read that explained the situation in detail.
I've had many discussions in several circles of coworkers and former coworkers about "multi-tenant" or "multi-region" architecture. The one thing that stands out to me is that companies that went the hard way and manage their rack servers, hypervisors, or even VMs directly have a much shorter and cheaper path to data center redundancy than companies that went all in on AWS. For reasons that aren't entirely clear to me, companies that have made literally billions on a cookie cutter AWS deploy with little redundancy can't seem to afford or want to pay for the additional safety. From what I've heard, using multiple IaaS providers is really fucking expensive.
our IT dept was effectively frozen on tracking any issues for the day."The event serves as a cautionary tale for all cloud services: More important than preventing race conditions and similar bugs is eliminating single points of failure in network design."
Unfortunately, economics works directly against this. Unless the cost of the occasional failure is very high, it can frequently make more sense to avoid the complexity & cost of redundant systems and all the extra design/maintenance/testing/admin they require. Eliminating single points of failure is HARD and you pretty much always miss something that doesn't show up in controlled testing.