Note Canvas is a web application where you can store all your notes! Click the button below to sign up; registration is instant!
Sign Up For Free Here!
Posted 3 months ago
Go Back

What Fastly's Downtime Tells Us About Software and Production Deployment

Uploaded Image

A few days ago, Fastly encountered a downtime that lasted for several hours. This is a significant event because a lot of sites are hosted on Fastly. Here's a short list of sites that are hosted on it:

  • Amazon
  • Reddit
  • Spotify
  • eBay
  • Twitch
  • Pinterest
  • Stackoverflow
  • Freelancer.com

It's no wonder so many people complained. Also, solving the problem might have been made even more difficult considering what @kartykx on Twitter had to say:

Uploaded Image

It's kind of like playing a new game without a guide. 

But jokes aside, what are the implications of fastly's temporary downtime in the bigger software engineering picture? What are its implications on production code? What can we learn from this?

Coming from fastly themselves, apparently someone "made a configuration change" that led to the bug, which led to the downtime of most of their network.

Uploaded Image

Probably for security reasons, Fastly didn't go that far into technical specifics in describing what happened, but gives us enough of an idea of the chain of events. Hopefully, fastly gives out a post-mortem report. In any case, someone changed a configuration variable, triggered an error, and sent a ripple effect throughout most of the sites fastly is hosting. 

What can be learned from this?

Insights


  • Unfortunately, fringe use cases are just a way of life for codebases in production.

    This isn't even the first time something like this happened. Back in July 2019, Cloudflare suffered an outage similar to this one. The culprit? A poorly written WAF rule on regex that ended up creating excessive backtracking. Say what you want about the poorly written regular expression rule, but no one expected it to delay it as much as it did, considering it was just written as means to have quick XSS detection. I suppose one can make the case that programmers underestimate the amount of time that it takes for greedy quantifiers to be evaluated, but it's impossible to think about its implications when considering all other cases in real-time.

  • What matters is that this was solved immediately, and the problem was quickly recognized.

    For a CDN as large as fastly, it is actually impressive that they got the problem solved as quickly as they did, even if the actual culprit was as innocuous as a variable change triggering a fringe use case. As mentioned above, accidents are unavoidable, and what matters is how the company responds to it.

  • Are some CDN's "too big to fail"?

    This is not about monopolies, or antitrust per se, but rather about how one service being down can cause a lot of the internet to be non-functioning, thus making it "too big to fail". It should make us think about how centralized some services are, that an outage of one CDN can cause large parts of the internet to go down. It may not seem that significant to most, but CDN's are akin to electric companies, except they manage websites. When the vast majority of bandwidth consumption on the internet is handled by a few companies, it is imperative that downtime is as minimal as possible, in terms of frequency and effects. This same concept applies to the CDN's cousin, the cloud service industry, where only a handful companies are managing most of the world's currently used server farms.

Comments

Notice: you can comment as a guest. Just click "Name", then check the "I'd rather post as a guest" option.

Note Canvas is a web application where you can store all your notes! Click the button below to sign up; registration is instant!
Sign Up For Free Here!
(Close)