The top news out of AWS this week is the AWS outage of December 7, 2021. What happened and why?
We’ll discuss in this post, starting with a high-level description of what happened with the AWS outage from the perspective of us AWS users. Then, we’ll walk through some of the behind-the-scenes goings-on that AWS has shared. And finally, we’ll leave you with some (hopefully) valuable take-aways.
Accelerate your career
Get started with ACG and transform your career with courses and real hands-on labs in AWS, Microsoft Azure, Google Cloud, and beyond.
Table of contents
What happened with the AWS outage?
On the morning of December 7, 2021, at 10:30AM, Eastern Time / 7:30AM Pacific Time, things went wrong in Amazon’s “us-east-1” region: North Virginia.
Over the next three minutes — which is pretty much all of a sudden, from our external point of view — a number of AWS services in the region started having issues, including but not limited to:
- Management console
- Route 53
- API Gateway
- EventBridge (what used to be called CloudWatch Events)
- Amazon Connect
Now, to be clear, the issue was not a complete outage for all of these services. For example, if you already had an EC2 instance running when the problem started, it would likely keep running just fine throughout the entire event.
However, what that running instance could do might well have been impacted. For example, an EC2 instance would have had trouble connecting through the no-longer-working VPC Endpoints to the still-working S3 and DynamoDB.
Furthermore, not only did the issue affect all availability zones in us-east-1, but it also broke a number of global services that happen to be homed in this region. This included AWS Account root logins, Single Sign-On (SSO), and the Security Token Service (STS).
The overall impact was broad, with the issue causing varying degrees of problems for services like Netflix, Disney Plus, Roomba, Ticketmaster, and the Wall Street Journal. It also affected many Amazon services, including Prime Music, Ring doorbells, logistics apps in their fulfillment centers, and some parts of the Amazon.com shopping site, which would instead show pictures of cute dogs.
It was a big deal. So, of course, folks took to the internet to discuss.
One Reddit user, a “ZeldaFanBoi1988”, wrote:
“Since I can’t get any work done, I decided to relax and order in some pizza. Then I tried ordering online from the Jet’s Pizza site. 500 errors. lol. looked at network request headers. Its AWS…..”
But no one seemed to be too upset at the companies impacted by the outage. In fact, as an example, several people instead took the opportunity to share how much they love Jet’s!
(Editor’s note: As a matter of fact-checking, I can confirm that Jet’s is dang tasty.)
And ZeldaFanBoi1988 did get their pizza, anyway, reporting back: “I ordered on the phone like a peasant. AWS is really ruining my day.”
Status Dashboard and Support Tickets
But there were some other nasty problems, too — even nastier than ZeldaFanBoi1988 having to order pizza over the phone like it’s 2008.
First, despite all the issues, the AWS Status dashboard continued for far too long to show all green lights for all services.
And second, it was no longer possible to log support tickets with Amazon because their Support Contact Center was broken, too! This client communication made a lot of people pretty upset.
It took almost an hour for the status dashboard to start reporting any issues, and support tickets stayed broken all the way until the underlying issues had been addressed and services were coming back online.
Now, speaking of “underlying issues,” let’s rewind to the beginning and take a look at those.
See how to think like an SRE
Watch this free, on-demand webinar to see Alex Hidalgo, Director of Site Reliability Engineering at Nobl9, break down SRE culture and tooling.
AWS outage causes: Internal issues
Internal Network Congestion
On December 7, 2021, at 10:30AM, Eastern Time / 7:30AM Pacific Time, an automated system in Amazon’s “us-east-1” region (North Virginia) tried to scale up an internal service running on AWS’s private internal network — the one they use to control and monitor all of their Amazon Web Services.
As AWS describes it, this “triggered an unexpected behavior from a large number of clients inside the internal network”.
Basically, AWS unintentionally triggered a Distributed Denial of Service (or DDoS attack) on their own internal network. Yikes.
As an analogy, it was as if every single person who lives in a particular city got into their car and drove downtown at the same time. Instant gridlock. Nothing moving. Not ambulances. Not news reporters. Not even traffic cops who could try to resolve the issue.
Now, we do know how we should avoid network congestion problems like this: we use exponential backoff and jitter. Unfortunately, this requires each client to do the right thing, and, as AWS writes in their report: “a latent issue prevented these clients from adequately backing off during this event.”
So, the AWS folks were sort of flying blind because their internal monitoring had been taken out by the flood. They looked at logs and figured that maybe it was DNS. It’s always DNS, right? (There’s even that haiku about it.)
Well, two hours after the problems started, they had managed to fully recover internal DNS resolution. And although this reportedly did help it did not solve everything. So, quite surprisingly, it was not DNS, this time.
AWS outage (full) resolution
For the next three hours after that, the AWS engineers worked frantically, trying everything. Or, as AWS puts it, “Operators continued working on a set of remediation actions to reduce congestion on the internal network including identifying the top sources of traffic to isolate to dedicated network devices, disabling some heavy network traffic services, and bringing additional networking capacity online.”
Then, at 12:35 PM, Pacific time—or 3:35 PM Eastern—AWS operators disabled event delivery for EventBridge (CloudWatch Events) to reduce the load on the affected network devices. And whether this was the lynchpin or just one of the drops in the bucket, things finally did start getting better. AWS reports that internal network congestion was “improving” by 1:15, “significantly improved” by 1:34, and “all network devices fully recovered by 2:22 PM Pacific Standard Time.”
And although that resolved the network flood and their Support Contact Center, it still took some more time for all the Amazon Web Services to come back online. API Gateway, Fargate, and EventBridge were among the slowest to fully stabilize, needing until at least 6:40 PM Pacific, or 9:40 PM, Eastern. What a day, huh?
You can read the AWS summary of the outage event here.
Lessons from the AWS outage
What can AWS learn from the outage?
Okay. So AWS has called out some things they’ve learned from this event.
One key thing is that they need to do a better job of communicating with customers during operational issues and not let those systems go down at the same time. They are planning some major upgrades here, but we’ll have to wait and see how that all goes. Of course they’re also working to fix the backoff bug, plus some additional network-level mitigation to try to prevent another storm. They concluded their report with, “We will do everything we can to learn from this event and use it to improve our availability even further.”
What can we learn from the outage?
But what about us then? What can we learn?
During the event, there were lots of responses—ranging from the “throw the baby out with the bathwater”-type “NO CLOUD FOR YOU!” to the naïve “multi-cloud solves everyhing!”
Of course lots more were more moderate “Multi-region, at least.” But don’t overreact, because knee-jerk architecture change is, by definition ill-considered.
Consider the SRE book, Site Reliability Engineering. It’s all about how to keep important systems running. And I want to share you a quote from chapter three, Embracing Risk:
“. . . past a certain point, however, increasing reliability is worse for a service (and its users) rather than better! Extreme reliability comes at a cost . . . an incremental improvement in reliability may cost 100 times more than the previous increment.”
To put this into perspective, what you might previously have accomplished as a single individual in a month may then take a team of 10 people a whole year, instead. Does that sound like a good tradeoff? And just imagine the ongoing costs to operate and maintain a system that is so much more complex! Yikes.
The book goes on to discuss some of those costs, but it all boils down to the need to make tradeoffs. And I think this quote summarizes the most important takeaway:
“. . . rather than simply maximizing uptime, Site Reliability Engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users’ overall happiness—with features, service, and performance—is optimized.”
Everything fails all the time
We’ve learned that agility is the name of the game in IT. Figure out what actually matters most to your users.
And, as Werner Vogels — the CTO of Amazon — is famous for saying: “Everything fails, all the time.”
Now, it definitely is possible for us to come up with strategies and architectures that avoid us being impacted by a repeat of this particular problem. But if this were a simple thing to do, in advance—whether through multi-region, or whatever — then Amazon would already have done that for things like their AWS status dashboard.
But as AWS pointed out, “networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region.”
Yep; they actually did have a multi-region setup, but their failover mechanism failed. Like losing the key to your doomsday bunker and winding up locked out. Prepared in theory, but not in practice. Those are seriously smart engineers they have, but they’re also still human.
In practice, it gets complicated — especially because you don’t know how things will fail.
When we used to have to build and manage everything ourselves on simple instances, failures were a bit more predictable: instances would die or become unavailable. But when we take advantage of managed services, then we wind up with rather different kinds of failures.
Now, to be clear, it’s foolish to ignore managed services just because they could possibly fail, sometimes. That would be like deciding to only walk and swim your way around the world because you’ve heard that some planes have crashed and some boats have sunk. It’s impossible to be agile and not build upon the work of others.
Should you stop using and building on AWS? No!
OK. Finally, let’s say you ask me how much this event has impacted my willingness to use and build on AWS — to rely on them. I’ll answer you, “Not at all.”
That’s not to say that I like outages like this, nor that I’ll ignore the possibility of their happening again. But much like how I still confidently travel by air and trust the pilots more than I trust myself to fly those planes, I am still way better off with AWS — the entire package they offer, faults and all — than I am on my own. And I’m going to say with a pretty high degree of accuracy that you are, too.
So don’t overreact, and don’t underreact, either.
Incorporate what you’ve learned as data points alongside all the others. Move past feelings and knee-jerk reactions to make rational decisions. And when, down the road, you recognize that still not every decision you’ve made was perfect, then apply this same blameless postmortem technique to learn from that situation and do better going forward. That’s really all any of us can expect of ourselves, I think.
So take care of those around you, embrace #HugOps, and keep being awesome, cloud gurus!
Keep up with all things AWS
Looking to learn more about cloud and AWS? Check out our rotating line-up of free courses, which are updated every month. (There’s no credit card required!)