Cloudflare broke much of the Internet with a corrupted bot management file

This is causing us huge headaches. We run a platform hosted on AWS, and this incident took down all our clients' access. Now we have clients with multimillion-dollar-per-year contracts complaining about our platform's resilience.


They are right. Your platform was too reliant on a point of failure, no redundancy.
 
Upvote
25 (26 / -1)

Sam Hobbs

Smack-Fu Master, in training
27
Not sure why their list of future fixes doesn't include the ever-popular "test all changes in a staging environment first".
That is easy to say but it is not always possible to anticipate what to test for, In this case if they had anticipated the possibility of the file being too big then they could have written the program to allow for a bigger file.
should always be reported quickly and explicitly somewhere that a lot of different eyeballs can see
That is what I say. That something needed to be done to report the problem in a manner that they would have known about the problem sooner.
identify all edge cases
Testing for all possibilities might not be practical.
 
Upvote
12 (12 / 0)

balthazarr

Ars Tribunus Angusticlavius
6,904
Subscriptor++
The internet was fine. None of my systems were affected in the slightest. Clients of one company had a problem. That was it.
YOU were barely affected, so that means others weren't?

I get your point - the Internet is more than just the web and, even then, only a subset was affected - except that to 99% of the population, it's not. Everything they do is "the web" - unless it's on an app, which half the time is just a wrapper for a web page anyway.

My point is a handful of companies control vast swaths of everything that is essential to the vast majority of people, and it wasn't intended to be that way.
 
Upvote
-7 (5 / -12)

yopmaster

Wise, Aged Ars Veteran
185
It may be easy to critize after the fact for some, but they are victims of hard-to-foresee interactions between software and infra.
This is the kind of scenario I have feared since I started to work with the common machine-learning stacks (SQL/Python, maybe some Java). The languages and frameworks check the type of variables, but don't do much about what is inside the dataframes: it is so easy for engineers to forget some checks here and there. Not mentioning "data teams" often don't share the same test culture as "regular" engineering teams.
In fact I'm surprised such disasters don't happen more often.
 
Upvote
5 (5 / 0)

Tam-Lin

Ars Scholae Palatinae
845
Subscriptor++
This is causing us huge headaches. We run a platform hosted on AWS, and this incident took down all our clients' access. Now we have clients with multimillion-dollar-per-year contracts complaining about our platform's resilience.
And they’re right to do so. Your architectural and vendor choices meant your platform went down.

Now, maybe they aren’t willing to pay for a more resilient platform, but ultimately, you’re selling them services. That you built your service on something that isn’t 100% available is on you. You need to do a better job of expectations management.

And this is a fundamental issue with any distributed system. You have dependencies to don’t think about, or sometimes don’t even know about. But ultimately, you’re only as available as the weakest link in your dependency graph.
 
Last edited:
Upvote
22 (22 / 0)

Tam-Lin

Ars Scholae Palatinae
845
Subscriptor++
That is easy to say but it is not always possible to anticipate what to test for, In this case if they had anticipated the possibility of the file being too big then they could have written the program to allow for a bigger file.
I’m sorry, but in this case, that’s not true. Limits/boundary testing is function test 101. If you tell me something has a limit, I’m going to test just below the limit, at the limit, and just above the limit. And possibly way over the limit if there’s time/resources.
 
Upvote
-6 (11 / -17)

SubWoofer2

Ars Tribunus Militum
2,664
Third party stuff would come into the systems via a channel with its own change process and checks and balances, and then the internal processes would apply I expect, so it should get more review.
Which leads to the question of, for business-critical designs, perhaps a comparable level of due diligence might be worth applying.

Does anyone at Cloudflare wear a black shirt?
 
Upvote
0 (0 / 0)

Fred Duck

Ars Tribunus Angusticlavius
7,336
This is why The Cloud™ frightens me. If a malicious actor say, deletes all synced data, do Cloud™ providers just delete everything everywhere?

If I was in charge of a system wherein changes on one node propagate throughout the network, I'd design the system to automatically backup the database in the event there's a drastic change. The system should of course, notify operators and allow for restoration in future.

Have we learnt nothing?

(This is not a Me Problem as my Cloud™ data is also backed up locally so if someone destroyed my Cloud™ data, I'd simply use my Time Machine to fetch a copy from the past but not everyone is detail-oriented as I am and baks up.)
 
Upvote
0 (4 / -4)

mygeek911

Ars Scholae Palatinae
948
Subscriptor++
I’m sorry, but in this case, that’s not true. Limits/boundary testing is function test 101. If you tell me something has a limit, I’m going to test just below the limit, at the limit, and just above the limit. And possibly way over the limit if there’s time/resources.
Taco Bell learned this lesson the hard way.

https://www.bbc.com/news/articles/ckgyk2p55g8o
 
Upvote
2 (3 / -1)

OrvGull

Ars Legatus Legionis
11,881
Remember when they claimed the Internet could survive nuclear warfare, being designed with that in mind?

Pepperidge Farm remembers. https://www.rand.org/pubs/articles/2018/paul-baran-and-the-origins-of-the-internet.html

Now, we've allowed it to devolve into a situation where one file can take down massive swaths of it.
There were (and are) hardened networks that was true of, but it was never true of the public internet.

The profit motive always argues against redundancy, because redundancy means you have capacity going to waste any time there isn't a problem.
 
Upvote
12 (12 / 0)

SeanJW

Ars Legatus Legionis
11,947
Subscriptor++
This is why The Cloud™ frightens me. If a malicious actor say, deletes all synced data, do Cloud™ providers just delete everything everywhere?

If I was in charge of a system wherein changes on one node propagate throughout the network, I'd design the system to automatically backup the database in the event there's a drastic change. The system should of course, notify operators and allow for restoration in future.

Have we learnt nothing?

(This is not a Me Problem as my Cloud™ data is also backed up locally so if someone destroyed my Cloud™ data, I'd simply use my Time Machine to fetch a copy from the past but not everyone is detail-oriented as I am and baks up.)

Cloud providers have their own backup, but if you delete your account, well, you made the choice. If it's significant enough the cloud provider may assist in recovery, but they're under no obligation to. Manage your backups carefully - and access control as well.
 
Upvote
2 (2 / 0)

Tam-Lin

Ars Scholae Palatinae
845
Subscriptor++
This is why The Cloud™ frightens me. If a malicious actor say, deletes all synced data, do Cloud™ providers just delete everything everywhere?

If I was in charge of a system wherein changes on one node propagate throughout the network, I'd design the system to automatically backup the database in the event there's a drastic change. The system should of course, notify operators and allow for restoration in future.

Have we learnt nothing?

(This is not a Me Problem as my Cloud™ data is also backed up locally so if someone destroyed my Cloud™ data, I'd simply use my Time Machine to fetch a copy from the past but not everyone is detail-oriented as I am and baks up.)
That’s why businesses that care about their data (so, not Facebook, but your bank) send archive tapes to be stored in underground bunkers on a regular basis.

But sending out a notification and expecting operators to do something/respond isn’t reasonable. Things happen too fast. Yeah, if you freeze things / don’t apply a change, you might prevent an outage due to data loss/changes, but in the time the operators are making a decision/consulting/whatever, work is either backing up in various places or just being dropped, which is also an outage. Either way, you aren’t meeting your SLAs.
 
Upvote
1 (2 / -1)

Tam-Lin

Ars Scholae Palatinae
845
Subscriptor++
There were (and are) hardened networks that was true of, but it was never true of the public internet.

The profit motive always argues against redundancy, because redundancy means you have capacity going to waste any time there isn't a problem.
Part of the problem is that we as humans hate what we perceive as price gouging, so if you’re, say a store that decides to stockpile goods so that you can handle some sort of supply disruption, meaning you’re going to lock up capital / decrease returns when there isn’t a crisis, if you try to raise prices when such a disruption happens, you’ll be a pariah and may be fined/put in jail, depending on the state. I’d argue that if it was legal to charge more when there’s limited supply of something, it might then make economic sense to have some excess capacity lying around because you could make up the profits you miss during quiet times with the money you make during crisis times, but our monkey brains object.
 
Upvote
-10 (3 / -13)

eldakka

Ars Tribunus Militum
1,747
Subscriptor
Not sure why their list of future fixes doesn't include the ever-popular "test all changes in a staging environment first".
They very well may have. However, if the dataset in the stage database is smaller than prod, then a doubling of the filesize in stage may have kept it under the 200 limit, so it may have passed fine.

Replicating data in environments can be difficult, as you often just can't put prod data into stage (for privacy/security reasons), so you often use 'simulated' datasets in non-rpod environments.

We had a similar in concept issue at my organisation about a decade ago. It was an Oracle database upgrade.It went fine until they did it in pord and all hell broke loose - database queries were timing out, the CPU usage of the database cluster went through the roof, even an emergency live-doubling the CPUs didn't fix the performance issues ...

What happened was the new version of the database software re-created query execution plans - that is, the optimizations for various queries were reset to new values by the uprade. Since this didn't cause any issue through at least 4 different non-prod environments, everyone ticked off for prod release.

But broken query optimizations ran fine on staging tables with a million rows. But when the same thing happened in prod with tables that have billions of rows, mayhem ensued.
 
Upvote
26 (26 / 0)

alansh42

Ars Praefectus
3,642
Subscriptor++
Part of the problem is that we as humans hate what we perceive as price gouging, so if you’re, say a store that decides to stockpile goods so that you can handle some sort of supply disruption, meaning you’re going to lock up capital / decrease returns when there isn’t a crisis, if you try to raise prices when such a disruption happens, you’ll be a pariah and may be fined/put in jail, depending on the state. I’d argue that if it was legal to charge more when there’s limited supply of something, it might then make economic sense to have some excess capacity lying around because you could make up the profits you miss during quiet times with the money you make during crisis times, but our monkey brains object.
This doesn't work. If events requiring the stockpile are common, everybody will keep inventory and you won't have an advantage. If they're not, you will most likely end up holding inventory for years and an elevated price during a shortage won't make up for the costs. The only winners are the ones that had the random luck of having inventory that they could sell far over their costs.

Allowing gouging won't increase supply. The simplistic libertarian view is that goods can always appear when needed, if only the government would get out of the way. It's the spherical frictionless consumer on an infinite plane version of economics.
 
Last edited:
Upvote
10 (13 / -3)
This sounds a bit similar to how CrowdStrike BSOD'd thousands of Windows machines, except Cloudflare did it to their own machines. Ooops. One bad file really shouldn't be able to take down entire systems, but maybe that's just me.
The root of most problems in the world is just one word; assumptions. The Cloudflare outage was caused by an assumption that a situation would not happen. They were proven wrong.

The world runs on assumptions. The world is assumptions. Sure, we try to replace those assumptions with proven facts, but chances are that there always will be more assumptions further down in the stack. Even the most careful of setups have assuptions burried somewhere deed down below.
Heck, even mathematics, the high mark of proof and facts ... rely on assumptions. We can't prove the most basic foundation of mathematics (and there is even a proof of that :D ).
 
Upvote
15 (17 / -2)

gavron

Ars Tribunus Militum
1,595
This is causing us huge headaches. We run a platform hosted on AWS, and this incident took down all our clients' access. Now we have clients with multimillion-dollar-per-year contracts complaining about our platform's resilience.
No, you don't.

Any company with "multimillion dollar per year contracts" that uses a middleman instead of contracting with CF are staffed by idiots who have hired idiots. Your headache is YOU put all YOUR eggs in one basked and failed to deilver redundany and reliability.

But that didn't stop you from ALLEGEDLY billing multimillion-dollars-per year. Your headache is from your shame at your failure to plan for exigent circumstances. Go stand in front of a mirror and take an aspirin.

Next time if you're billing 7-8 digits act like it and don't go whining about your aches.
 
Upvote
16 (16 / 0)

Mr. Barky

Smack-Fu Master, in training
4
Not sure why their list of future fixes doesn't include the ever-popular "test all changes in a staging environment first".
To test a problem, you have to imagine the failure mode. Production will always turn up something that you didn't explicitly test for. The number of possible "bad" files far exceed your ability to test them all. A staging environment never can include everything that is in production. I am sure 100% of the tests returned less than 200 features.

A review of the code could (and arguably should )have turned up this failure mode to add to test cases - but that is easy to say in hindsight. A programmer might have reviewed the code and thought it was normal and without consequence as there will "never" be more than 200 features. Or maybe the reviewer was just tired the night of the review (code reviews can be tedious and boring) and failed to even see it could cause problems.
 
Upvote
13 (13 / 0)

markrk

Seniorius Lurkius
24
Hardcoding your configuration causes problems.
There's a reason Microsoft couldn't make a Windows 9.
Mad Klingon said:


Seems someone left out a test similar to

If rowcount >= 200 then
{
print errormessage;
don't propagate file;
use old file until human fixes problem;
}


Actually the test was the issue. They did an unwrap() though instead of reverting. That is, they did a Rust "assert " that did not rollback but just failed because it was thought to be in an unrecoverable state.

Rust can have bugs that take out large swaths as well. Who would have thunk it based on the Rust snobs.

Only ripping on Rust on this as it does do good things...but I am a bit sick of a certain segment of people that seem to think it has a real AI in it catching errors.
 
Upvote
4 (4 / 0)

cheesecakegood

Smack-Fu Master, in training
84
What I find interesting is not the cause - unexpected interactions happen, bugs (or unplanned consequences that aren't strictly a bug) and they did a RCA and handled it appropriately from what I can understand. Not a happy event, but professionally dealt with.

It's how low the numbers involved are. More than 200 is double the amount of features? So there's looking at around less than 200, probably closer to 100 (otherwise they'd be more twitchy about that limit in normal operation). I putter around with a toy detector just to keep my hand in (and swear at how utterly stupid bot writers are at making it obvious sometimes), and it tracks maybe a couple of dozen features I play with manually - no automation or intelligence in creating those features, it's all me fiddling. And it turns out even with some much more clever people and automation doing the work, they're still only needing a low number of features of requests to pick up botnets more accurately?

I think that goes to show that bot writers really are utterly lazy if they're still being detectable with that low amount. That's not to denigrate the CloudFlare (and other companies) work in detecting them; it's that they're able to with such relatively low work shows they're doing good work.

Edit: And they are clever people btw - those features have to be calculated fast in real time so as not to impact the request performance for good requests, and legitimate clients of all sorts exists that are in many cases just as stupid as botnets. You can't just assume what are "illegal" requests normally are actually bad - I've seen clients do things that are outright illegal under a strict reading of the HTTP protocols but work because the backend server knows about it and is expecting it, so a proxy fronting onto it just has to let it through.

Edit 2: My nightmare fuel for this? Online banking. You want absolute peak paranoia. Lots of the client apps are shoddy farmed out to whatever third party was the cheapest contractor. If you're lucky its just a thin wrapper around their web site. And they're coming from all sorts of crappy phones that may have been released in the last decade (or more!) and never had a single security update. Trying to work out "is this an attack, or did everyone just get paid today" is a complete pain.

Edit 3: As an example of weird shit clients do - the Bitwarden password manager uses a web socket for notifications if you're not on a mobile client (which uses normal phone notifications); it's used to notify of changes so all clients keep in sync. No problems. Web sockets are a pain but well understood. Open a request, do an Upgrade etc. What request does Bitwarden open? CONNECT. CONNECT is not typically a client request to a server, it's a client request to a proxy - it's (ironically) used to open a socket, typically used for TLS. It's CONNECT host: port HTTP/x etc. Bitwarden uses CONNECT /path HTTP/x . Good effort. So if something is twitchy about people trying to sneak proxy requests through instead of legitimate requests, Bitwarden will be setting off alarm bells all over the place...
To be fair re: number of features, they are almost guaranteed to be artificially compressed features, assembled from a much larger amount of data about bot behavior and other features. Number of features does not strongly imply that the underlying behavior is not complex, only very mildly. Also you have to remember that the space of interactions between features has the curse of dimensionality thing going on: there is a factorial-scaling combination of two-way relationships among the number of features.

So your intuition here is likely not correct. That’s not what the machine learning math tells us. Especially weak is asserting that e.g. 80 features implies that the underlying behavior is not very complex. It’s not really a rule of thumb, but to give you an idea, once you get past a couple dozen features, this implication starts to disappear. For small, sub-20 features this might be justifiable! But most people forget the higher order dimensionality is not intuitive. I doubt Cloudflare uses that few.
 
Upvote
-1 (0 / -1)

Demento

Ars Legatus Legionis
15,477
Subscriptor
Not sure why their list of future fixes doesn't include the ever-popular "test all changes in a staging environment first".
1763643435822.png
 
Upvote
-3 (1 / -4)

Xyler

Ars Scholae Palatinae
1,400
You know Cloudfront is able to do most of what Cloudflare does right? Is there any special Cloudflare features you rely on specifically? (I only mention it because you're already all in on AWS - it's not an endorsement of Cloudfront over any other CDN type product btw)

Edit: "No cost effective/worthwhile changing" is a perfectly fine answer :) I understand the world is more complicated than "just change!" offers.
Free DDOS and WAF protections. Cloudflare does a lot for people. Even my itty-bitty homelab is protected by Cloudflare's DDOS protection, at no charge to me.
 
Upvote
5 (5 / 0)

Xyler

Ars Scholae Palatinae
1,400
I’m sorry, but in this case, that’s not true. Limits/boundary testing is function test 101. If you tell me something has a limit, I’m going to test just below the limit, at the limit, and just above the limit. And possibly way over the limit if there’s time/resources.
Reminds me of a joke I once heard:

A developer is testing their new beer bar. They order 1 beer, 10 beers, 9999999 beers, -1 beers, 0 beers, everything seems to be working as intended.

First real customer walks into the bar, asks where the bathrooms are. The Bar explodes.
 
Upvote
18 (18 / 0)

niwax

Ars Praefectus
3,344
Subscriptor
The failure mode is something that could happen to everyone. I have myself pushed an update that was supposed to vastly improve error handling by including a detailed JSON response, which promptly took out an outdated system a few thousand builds behind that checked for errors by using "if response.startswith("Error")" instead of the response code. Shit happend.

What only people with an overeager infrastructure team have is a globally distributed Clickhouse database for a table with an expected 60 rows that is accessed by a bunch of servers every five minutes. Also, the instability from running an eventually-consistent globally replicated database. Imagine how boring debugging this would have been if those 60 lines were in cloudflare.com/bots-first-wave.txt and cloudflare.com/bots-full-rollout.txt. Click publish on the first wave. Oops, some backends start failing. Roll back change instead of going for the full rollout.
 
Upvote
1 (1 / 0)

Tam-Lin

Ars Scholae Palatinae
845
Subscriptor++
Reminds me of a joke I once heard:

A developer is testing their new beer bar. They order 1 beer, 10 beers, 9999999 beers, -1 beers, 0 beers, everything seems to be working as intended.

First real customer walks into the bar, asks where the bathrooms are. The Bar explodes.
The restroom wasn't in the spec. Not my problem.
 
Upvote
12 (12 / 0)

niwax

Ars Praefectus
3,344
Subscriptor
To test a problem, you have to imagine the failure mode. Production will always turn up something that you didn't explicitly test for. The number of possible "bad" files far exceed your ability to test them all. A staging environment never can include everything that is in production. I am sure 100% of the tests returned less than 200 features.

A review of the code could (and arguably should )have turned up this failure mode to add to test cases - but that is easy to say in hindsight. A programmer might have reviewed the code and thought it was normal and without consequence as there will "never" be more than 200 features. Or maybe the reviewer was just tired the night of the review (code reviews can be tedious and boring) and failed to even see it could cause problems.

The thing is, it could reasonably be assumed that those test cases shouldn't be needed.

The publisher makes sure only ~60 rules can be published.
The database table always only contains ~60 rows.
The consumer contains a panic-and-restart feature if it receives malformed data, including > 100 rows, which is plenty below its memory limit of 200 rows.

You can unit and integration test the inputs and outputs of all of those and it will show up working as designed. Not assumption is ever broken.

The Clickhouse pushes an update that will multiply the number of rows if an unrelated second table in the database cluster has the same name and everything goes to shit. Would you have thought to test for that?
 
Last edited:
Upvote
9 (9 / 0)
Not sure why their list of future fixes doesn't include the ever-popular "test all changes in a staging environment first".
I could see a world in which the test environment gave a green light to the change. From the description, there were multiple places where limitations of the testing evironment could have prevented this edge case from coming up. By its nature, the test environment for Cloudflare can not be a 1:1 replica of the scale and complexity of production.

As one developer for City of Heroes I knew once put it, there is no amount of testing that compares to a half million people logging onto your server at the same time. That cloudflare faces as few breakdowns as it does is fucking amazing.
 
Upvote
8 (8 / 0)

J.King

Ars Praefectus
4,424
Subscriptor
I get your point - the Internet is more than just the web and, even then, only a subset was affected - except that to 99% of the population, it's not. Everything they do is "the web" - unless it's on an app, which half the time is just a wrapper for a web page anyway.
It's more than this: the Internet was never designed to be resilient in the face of such a problem in the first place. Putting aside the whole nuclear-attack myth (it might have been the impetus, but it was never a design goal of what we actually got), a non-responsive hop can be routed around, but once you reach your destination's doorstep, you're forced to rely on any redundancy they have rather than that of the network at large. Cloudflare is an intermediary, but for all intents and purposes they do function as the host for a lot of the public-facing Internet, and if they go down, that's their failure, not the Internet's.

Is it bad that so many eggs are in one basket? Yes, of course. Three years back half the Internet went down across Canada because Rogers Communications suffered a complete outage. This also affected large amounts of cellular phone service, the Interac payment network, and lots of other things you wouldn't expect. In the wake of this there was a push for more redundancy in the technology services sector, but I would be surprised if another such outage isn't just as disruptive, because redundancy is expensive even when it's not very complex. Lots of money, time, and effort can be saved by crossing your fingers and hoping nothing goes wrong rather than actually preparing for the worst.

The problem is less that Cloudflare is too big than that hosts don't have a plan B.
 
Upvote
6 (6 / 0)

hubick

Ars Scholae Palatinae
1,041
Subscriptor
The part that's interesting to me is their not initially realizing the issue was due to changes they were making themselves.

It's hard to believe we do something right where I work - but all our production changes have a change ticket approved by a central group that certifies we have communications and rollback plans, etc, and kinda coordinate that stuff so various groups don't interfere with each other - where, if something were to go wrong, there would definitely be someone knowing what is potentially going on with all our systems at that time that might cause an issue. It's interesting that Cloudflare's database permission changes happened in production, and someone wasn't immediately yelling "THIS COULD BE X, THEY'RE PUSHING CHANGES RIGHT NOW"
 
Upvote
-4 (0 / -4)

SeanJW

Ars Legatus Legionis
11,947
Subscriptor++
It's more than this: the Internet was never designed to be resilient in the face of such a problem in the first place. Putting aside the whole nuclear-attack myth (it might have been the impetus, but it was never a design goal of what we actually got), a non-responsive hop can be routed around, but once you reach your destination's doorstep, you're forced to rely on any redundancy they have rather than that of the network at large. Cloudflare is an intermediary, but for all intents and purposes they do function as the host for a lot of the public-facing Internet, and if they go down, that's their failure, not the Internet's.

Is it bad that so many eggs are in one basket? Yes, of course. Three years back half the Internet went down across Canada because Rogers Communications suffered a complete outage. This also affected large amounts of cellular phone service, the Interac payment network, and lots of other things you wouldn't expect. In the wake of this there was a push for more redundancy in the technology services sector, but I would be surprised if another such outage isn't just as disruptive, because redundancy is expensive even when it's not very complex. Lots of money, time, and effort can be saved by crossing your fingers and hoping nothing goes wrong rather than actually preparing for the worst.

The problem is less that Cloudflare is too big than that hosts don't have a plan B.

They have to evaluate risks versus rewards same as every other situation. Cloudflare (and everyone else) never promise 100% availability. A perfectly valid and justifiable plan B can be "suck it up princess" and for a large number of their customers, that was perfectly fine. A few hours of outage? NBD.
 
Upvote
3 (3 / 0)

SeanJW

Ars Legatus Legionis
11,947
Subscriptor++
The failure mode is something that could happen to everyone. I have myself pushed an update that was supposed to vastly improve error handling by including a detailed JSON response, which promptly took out an outdated system a few thousand builds behind that checked for errors by using "if response.startswith("Error")" instead of the response code. Shit happend.

What only people with an overeager infrastructure team have is a globally distributed Clickhouse database for a table with an expected 60 rows that is accessed by a bunch of servers every five minutes. Also, the instability from running an eventually-consistent globally replicated database. Imagine how boring debugging this would have been if those 60 lines were in cloudflare.com/bots-first-wave.txt and cloudflare.com/bots-full-rollout.txt. Click publish on the first wave. Oops, some backends start failing. Roll back change instead of going for the full rollout.

I can guarantee they used a canaried staged rollout of the code. But yay, dynamic systems. What looks good now might be total shit 10 minutes from now.

I remember a code yellow from Google where the code worked perfectly. For a while. Until it started OOMing (something sometimes anticipated and expected, but not at the rate it was happening). And it was still working, just not quite finishing things. So it started falling behind. Which isn't good when it's the garbage collector and your disk usage starts creeping up faster than projections until one day you find out "shit, we're really out of space and there's 17Pb that needs to be garbage collected really fast...."

Edit: There were some enteraining moments like that.... A data centre literally on fire? Nope, just a heads up, everyone take appropriate actions. A zero entered into a config file where it didn't mean what the committer and the reviewers thought it did? Oh shit, everyone in NetOps all hands on deck... (they thought it meant "unlimited" bandwidth....no, it literally meant 0 bandwidth, and assigning it to a control plane is a very bad idea....)

Edit 2: Did the internet at large notice any of these things? Nope. Because things were caught internally and the public side of things was able to cope fine. But not everything is that lucky.
 
Last edited:
Upvote
0 (1 / -1)

J.King

Ars Praefectus
4,424
Subscriptor
They have to evaluate risks versus rewards same as every other situation. Cloudflare (and everyone else) never promise 100% availability. A perfectly valid and justifiable plan B can be "suck it up princess" and for a large number of their customers, that was perfectly fine. A few hours of outage? NBD.
Sorry, I didn't intend to cast aspersions. For the vast majority of hosts the cost-to-benefit ratio of guarding against rare failure like this is indeed hard to justify. Still, the fact remains there's no other way around it: you either have redundancy, or you rely on your supplier having perfect up-time. There's few organizations I'd fault for choosing the latter, though.
 
Upvote
0 (1 / -1)