"I worry this is the big botnet flexing," CEO said. But outage was self-inflicted.
See full article...
See full article...
This is causing us huge headaches. We run a platform hosted on AWS, and this incident took down all our clients' access. Now we have clients with multimillion-dollar-per-year contracts complaining about our platform's resilience.
That is easy to say but it is not always possible to anticipate what to test for, In this case if they had anticipated the possibility of the file being too big then they could have written the program to allow for a bigger file.Not sure why their list of future fixes doesn't include the ever-popular "test all changes in a staging environment first".
That is what I say. That something needed to be done to report the problem in a manner that they would have known about the problem sooner.should always be reported quickly and explicitly somewhere that a lot of different eyeballs can see
Testing for all possibilities might not be practical.identify all edge cases
YOU were barely affected, so that means others weren't?The internet was fine. None of my systems were affected in the slightest. Clients of one company had a problem. That was it.
And they’re right to do so. Your architectural and vendor choices meant your platform went down.This is causing us huge headaches. We run a platform hosted on AWS, and this incident took down all our clients' access. Now we have clients with multimillion-dollar-per-year contracts complaining about our platform's resilience.
I’m sorry, but in this case, that’s not true. Limits/boundary testing is function test 101. If you tell me something has a limit, I’m going to test just below the limit, at the limit, and just above the limit. And possibly way over the limit if there’s time/resources.That is easy to say but it is not always possible to anticipate what to test for, In this case if they had anticipated the possibility of the file being too big then they could have written the program to allow for a bigger file.
Which leads to the question of, for business-critical designs, perhaps a comparable level of due diligence might be worth applying.Third party stuff would come into the systems via a channel with its own change process and checks and balances, and then the internal processes would apply I expect, so it should get more review.
Taco Bell learned this lesson the hard way.I’m sorry, but in this case, that’s not true. Limits/boundary testing is function test 101. If you tell me something has a limit, I’m going to test just below the limit, at the limit, and just above the limit. And possibly way over the limit if there’s time/resources.
There were (and are) hardened networks that was true of, but it was never true of the public internet.Remember when they claimed the Internet could survive nuclear warfare, being designed with that in mind?
Pepperidge Farm remembers. https://www.rand.org/pubs/articles/2018/paul-baran-and-the-origins-of-the-internet.html
Now, we've allowed it to devolve into a situation where one file can take down massive swaths of it.
This is why The Cloud™ frightens me. If a malicious actor say, deletes all synced data, do Cloud™ providers just delete everything everywhere?
If I was in charge of a system wherein changes on one node propagate throughout the network, I'd design the system to automatically backup the database in the event there's a drastic change. The system should of course, notify operators and allow for restoration in future.
Have we learnt nothing?
(This is not a Me Problem as my Cloud™ data is also backed up locally so if someone destroyed my Cloud™ data, I'd simply use my Time Machine to fetch a copy from the past but not everyone is detail-oriented as I am and baks up.)
That’s why businesses that care about their data (so, not Facebook, but your bank) send archive tapes to be stored in underground bunkers on a regular basis.This is why The Cloud™ frightens me. If a malicious actor say, deletes all synced data, do Cloud™ providers just delete everything everywhere?
If I was in charge of a system wherein changes on one node propagate throughout the network, I'd design the system to automatically backup the database in the event there's a drastic change. The system should of course, notify operators and allow for restoration in future.
Have we learnt nothing?
(This is not a Me Problem as my Cloud™ data is also backed up locally so if someone destroyed my Cloud™ data, I'd simply use my Time Machine to fetch a copy from the past but not everyone is detail-oriented as I am and baks up.)
Part of the problem is that we as humans hate what we perceive as price gouging, so if you’re, say a store that decides to stockpile goods so that you can handle some sort of supply disruption, meaning you’re going to lock up capital / decrease returns when there isn’t a crisis, if you try to raise prices when such a disruption happens, you’ll be a pariah and may be fined/put in jail, depending on the state. I’d argue that if it was legal to charge more when there’s limited supply of something, it might then make economic sense to have some excess capacity lying around because you could make up the profits you miss during quiet times with the money you make during crisis times, but our monkey brains object.There were (and are) hardened networks that was true of, but it was never true of the public internet.
The profit motive always argues against redundancy, because redundancy means you have capacity going to waste any time there isn't a problem.
They very well may have. However, if the dataset in the stage database is smaller than prod, then a doubling of the filesize in stage may have kept it under the 200 limit, so it may have passed fine.Not sure why their list of future fixes doesn't include the ever-popular "test all changes in a staging environment first".
... duh ? How many mistakes have you made at work in the last year ?...suggests it will, but probably not from the same variety of malignancy they treated.
This doesn't work. If events requiring the stockpile are common, everybody will keep inventory and you won't have an advantage. If they're not, you will most likely end up holding inventory for years and an elevated price during a shortage won't make up for the costs. The only winners are the ones that had the random luck of having inventory that they could sell far over their costs.Part of the problem is that we as humans hate what we perceive as price gouging, so if you’re, say a store that decides to stockpile goods so that you can handle some sort of supply disruption, meaning you’re going to lock up capital / decrease returns when there isn’t a crisis, if you try to raise prices when such a disruption happens, you’ll be a pariah and may be fined/put in jail, depending on the state. I’d argue that if it was legal to charge more when there’s limited supply of something, it might then make economic sense to have some excess capacity lying around because you could make up the profits you miss during quiet times with the money you make during crisis times, but our monkey brains object.
Pro tip: never, ever program in anger. Even if the job requires you to use SQL in MS Access.I never got to use SQL in anger.
I always wanted to typeCOMMIT;and then hit Enter with a dramatic flourish.
The root of most problems in the world is just one word; assumptions. The Cloudflare outage was caused by an assumption that a situation would not happen. They were proven wrong.This sounds a bit similar to how CrowdStrike BSOD'd thousands of Windows machines, except Cloudflare did it to their own machines. Ooops. One bad file really shouldn't be able to take down entire systems, but maybe that's just me.
No, you don't.This is causing us huge headaches. We run a platform hosted on AWS, and this incident took down all our clients' access. Now we have clients with multimillion-dollar-per-year contracts complaining about our platform's resilience.
To test a problem, you have to imagine the failure mode. Production will always turn up something that you didn't explicitly test for. The number of possible "bad" files far exceed your ability to test them all. A staging environment never can include everything that is in production. I am sure 100% of the tests returned less than 200 features.Not sure why their list of future fixes doesn't include the ever-popular "test all changes in a staging environment first".
"And to test this we'll just pull out our copy of the Internet and run a copy of our entire infrastructure on it..."Not sure why their list of future fixes doesn't include the ever-popular "test all changes in a staging environment first".
Seems someone left out a test similar to
If rowcount >= 200 then
{
print errormessage;
don't propagate file;
use old file until human fixes problem;
}
Mad Klingon said:Hardcoding your configuration causes problems.
There's a reason Microsoft couldn't make a Windows 9.
To be fair re: number of features, they are almost guaranteed to be artificially compressed features, assembled from a much larger amount of data about bot behavior and other features. Number of features does not strongly imply that the underlying behavior is not complex, only very mildly. Also you have to remember that the space of interactions between features has the curse of dimensionality thing going on: there is a factorial-scaling combination of two-way relationships among the number of features.What I find interesting is not the cause - unexpected interactions happen, bugs (or unplanned consequences that aren't strictly a bug) and they did a RCA and handled it appropriately from what I can understand. Not a happy event, but professionally dealt with.
It's how low the numbers involved are. More than 200 is double the amount of features? So there's looking at around less than 200, probably closer to 100 (otherwise they'd be more twitchy about that limit in normal operation). I putter around with a toy detector just to keep my hand in (and swear at how utterly stupid bot writers are at making it obvious sometimes), and it tracks maybe a couple of dozen features I play with manually - no automation or intelligence in creating those features, it's all me fiddling. And it turns out even with some much more clever people and automation doing the work, they're still only needing a low number of features of requests to pick up botnets more accurately?
I think that goes to show that bot writers really are utterly lazy if they're still being detectable with that low amount. That's not to denigrate the CloudFlare (and other companies) work in detecting them; it's that they're able to with such relatively low work shows they're doing good work.
Edit: And they are clever people btw - those features have to be calculated fast in real time so as not to impact the request performance for good requests, and legitimate clients of all sorts exists that are in many cases just as stupid as botnets. You can't just assume what are "illegal" requests normally are actually bad - I've seen clients do things that are outright illegal under a strict reading of the HTTP protocols but work because the backend server knows about it and is expecting it, so a proxy fronting onto it just has to let it through.
Edit 2: My nightmare fuel for this? Online banking. You want absolute peak paranoia. Lots of the client apps are shoddy farmed out to whatever third party was the cheapest contractor. If you're lucky its just a thin wrapper around their web site. And they're coming from all sorts of crappy phones that may have been released in the last decade (or more!) and never had a single security update. Trying to work out "is this an attack, or did everyone just get paid today" is a complete pain.
Edit 3: As an example of weird shit clients do - the Bitwarden password manager uses a web socket for notifications if you're not on a mobile client (which uses normal phone notifications); it's used to notify of changes so all clients keep in sync. No problems. Web sockets are a pain but well understood. Open a request, do an Upgrade etc. What request does Bitwarden open? CONNECT. CONNECT is not typically a client request to a server, it's a client request to a proxy - it's (ironically) used to open a socket, typically used for TLS. It's CONNECT host: port HTTP/x etc. Bitwarden uses CONNECT /path HTTP/x . Good effort. So if something is twitchy about people trying to sneak proxy requests through instead of legitimate requests, Bitwarden will be setting off alarm bells all over the place...
Not sure why their list of future fixes doesn't include the ever-popular "test all changes in a staging environment first".
Free DDOS and WAF protections. Cloudflare does a lot for people. Even my itty-bitty homelab is protected by Cloudflare's DDOS protection, at no charge to me.You know Cloudfront is able to do most of what Cloudflare does right? Is there any special Cloudflare features you rely on specifically? (I only mention it because you're already all in on AWS - it's not an endorsement of Cloudfront over any other CDN type product btw)
Edit: "No cost effective/worthwhile changing" is a perfectly fine answerI understand the world is more complicated than "just change!" offers.
Reminds me of a joke I once heard:I’m sorry, but in this case, that’s not true. Limits/boundary testing is function test 101. If you tell me something has a limit, I’m going to test just below the limit, at the limit, and just above the limit. And possibly way over the limit if there’s time/resources.
The restroom wasn't in the spec. Not my problem.Reminds me of a joke I once heard:
A developer is testing their new beer bar. They order 1 beer, 10 beers, 9999999 beers, -1 beers, 0 beers, everything seems to be working as intended.
First real customer walks into the bar, asks where the bathrooms are. The Bar explodes.
To test a problem, you have to imagine the failure mode. Production will always turn up something that you didn't explicitly test for. The number of possible "bad" files far exceed your ability to test them all. A staging environment never can include everything that is in production. I am sure 100% of the tests returned less than 200 features.
A review of the code could (and arguably should )have turned up this failure mode to add to test cases - but that is easy to say in hindsight. A programmer might have reviewed the code and thought it was normal and without consequence as there will "never" be more than 200 features. Or maybe the reviewer was just tired the night of the review (code reviews can be tedious and boring) and failed to even see it could cause problems.
I could see a world in which the test environment gave a green light to the change. From the description, there were multiple places where limitations of the testing evironment could have prevented this edge case from coming up. By its nature, the test environment for Cloudflare can not be a 1:1 replica of the scale and complexity of production.Not sure why their list of future fixes doesn't include the ever-popular "test all changes in a staging environment first".
It's more than this: the Internet was never designed to be resilient in the face of such a problem in the first place. Putting aside the whole nuclear-attack myth (it might have been the impetus, but it was never a design goal of what we actually got), a non-responsive hop can be routed around, but once you reach your destination's doorstep, you're forced to rely on any redundancy they have rather than that of the network at large. Cloudflare is an intermediary, but for all intents and purposes they do function as the host for a lot of the public-facing Internet, and if they go down, that's their failure, not the Internet's.I get your point - the Internet is more than just the web and, even then, only a subset was affected - except that to 99% of the population, it's not. Everything they do is "the web" - unless it's on an app, which half the time is just a wrapper for a web page anyway.
Then the bar should return an error code.The restroom wasn't in the spec. Not my problem.
It's more than this: the Internet was never designed to be resilient in the face of such a problem in the first place. Putting aside the whole nuclear-attack myth (it might have been the impetus, but it was never a design goal of what we actually got), a non-responsive hop can be routed around, but once you reach your destination's doorstep, you're forced to rely on any redundancy they have rather than that of the network at large. Cloudflare is an intermediary, but for all intents and purposes they do function as the host for a lot of the public-facing Internet, and if they go down, that's their failure, not the Internet's.
Is it bad that so many eggs are in one basket? Yes, of course. Three years back half the Internet went down across Canada because Rogers Communications suffered a complete outage. This also affected large amounts of cellular phone service, the Interac payment network, and lots of other things you wouldn't expect. In the wake of this there was a push for more redundancy in the technology services sector, but I would be surprised if another such outage isn't just as disruptive, because redundancy is expensive even when it's not very complex. Lots of money, time, and effort can be saved by crossing your fingers and hoping nothing goes wrong rather than actually preparing for the worst.
The problem is less that Cloudflare is too big than that hosts don't have a plan B.
The failure mode is something that could happen to everyone. I have myself pushed an update that was supposed to vastly improve error handling by including a detailed JSON response, which promptly took out an outdated system a few thousand builds behind that checked for errors by using "if response.startswith("Error")" instead of the response code. Shit happend.
What only people with an overeager infrastructure team have is a globally distributed Clickhouse database for a table with an expected 60 rows that is accessed by a bunch of servers every five minutes. Also, the instability from running an eventually-consistent globally replicated database. Imagine how boring debugging this would have been if those 60 lines were in cloudflare.com/bots-first-wave.txt and cloudflare.com/bots-full-rollout.txt. Click publish on the first wave. Oops, some backends start failing. Roll back change instead of going for the full rollout.
Sorry, I didn't intend to cast aspersions. For the vast majority of hosts the cost-to-benefit ratio of guarding against rare failure like this is indeed hard to justify. Still, the fact remains there's no other way around it: you either have redundancy, or you rely on your supplier having perfect up-time. There's few organizations I'd fault for choosing the latter, though.They have to evaluate risks versus rewards same as every other situation. Cloudflare (and everyone else) never promise 100% availability. A perfectly valid and justifiable plan B can be "suck it up princess" and for a large number of their customers, that was perfectly fine. A few hours of outage? NBD.