Top 10 outages of 2021


The biggest outages of 2021 had one thing in common: they affected major infrastructure or services providers and, as a result, affected large numbers of enterprises and end users. The lesson? Companies need to be careful about putting all their infrastructure eggs in one basket, or, if they must, to prepare for downtime if that particular service goes down.

“There needs to be a plan in place,” says Angelique Medina, head of product marketing at ThousandEyes, a Cisco-owned network intelligence company that tracks internet and cloud traffic. “Organizations don’t need to be at the mercy of the availability of any one particular service.”

Two of last year’s biggest outages included cloud providers AWS and Azure. Two involved Internet service providers Verizon and Azure. Four outages involved CDN and DNS providers Akamai, Cloudflare, and Fastly. And rounding out ThousandEyes’ list of the top 10 outages of 2021 are two Facebook outages. 

The Facebook outages didn’t just take down the social media network and other company services like Instagram and WhatsApp. Many enterprises use Facebook to authenticate users. When that service went down, users were no longer able to log into those enterprises’ websites.

“Authentication, like DNS, is often overlooked when people think about availability,” Medina tells Network World. (Read more about U.S. and worldwide outages in our weekly internet health check.)

Another overlooked networking issue that showed up in the top outages this year is BGP routing. BGP – which stands for border gateway protocol – tells Internet traffic what route to take. Even if the DNS listings point to the right destination, if the routing information is incorrect then traffic can be diverted to a dead-end route or a route that doesn’t have enough capacity to handle all the traffic.

“BGP hijacking can be really, really scary,” says Medina. “It can be a very challenging thing to control for and can have very damaging effects.”

Here’s the list of the top outages:

1. Facebook: Oct. 4

The biggest outage of the year in 2021 was October’s Facebook outage. “It was a hard down for about seven hours,” says Medina. “Seven hours is pretty significant.”

The outage affected all of Facebook’s services, including Instagram, WhatsApp and Oculus, as well as all the enterprises that use Facebook’s authentication mechanism.

A routine maintenance job went wrong, and both system servers and BGP routes were affected. Worse yet, not only did Facebook’s public-facing services go down, but also the tools that the employees use to manage those services. As a result, staffers had to physically enter the data centers to manually restart systems.

According to Facebook’s VP of infrastructure Santosh Janardhan, a command was accidentally issued that took down all the connections in its backbone network, disconnecting all of Facebook’s data centers.

“Our systems are designed to audit commands like these to prevent mistakes like this, but a bug in that audit tool prevented it from properly stopping the command,” Janardhan said in a report released the next day.

That mistake took down systems that respond to DNS queries. Since Facebook’s DNS servers could no longer connect to the data centers, they automatically disabled the related BGP advertisements, and so those DNS servers became unreachable even though they themselves were still up and running.

“All of this happened very fast. And as our engineers worked to figure out what was happening and why, they faced two large obstacles,” Janardhan wrote. “First, it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.”

The lesson here, according to Medina, is that companies need to keep their management systems well siloed from their production environment.

“And also consider diversifying in terms of who’s providing your services,” she adds. “Not necessarily just relying on your own internal services, but potentially considering external providers – or multiple external providers.”

2. AWS: Dec. 7

AWS is the largest cloud-computing service provider in the world, and when its services go down, millions of enterprises can be affected.

On December 7, an outage that lasted for over an hour affected Amazon’s own services, as well as consumer devices like Roomba and Ring and streaming services like Disney+ and Netflix, because of problems with AWS EC2 APIs in the US-EAST-1 region.

The outage highlighted the need for enterprises to monitor the health of all the APIs that are part of their applications and contribute to service delivery, customer experience, and the company’s ability to build and deploy, says Chris Villemez, senior technical marketing engineer at San Francisco-based ThousandEyes.

Compounding the problem, enterprise customers didn’t see any information on their AWS status page for more than an hour.

“It’s never a good idea to rely solely on a provider for information,” says Medina. “Having more of an independent view will give you more insight in real time.”

3. Fastly: June 8

Fastly is one of the smaller content delivery networks. According to Enlyft, the company has a 4% market share, compared to 39% for CloudFlare and 24% for Amazon CloudFront.

Still, more than 100,000 companies use its services, including Reddit and the New York Times. Even Amazon and eBay use some of Fastly’s services and were affected by Fastly’s June outage.

But customers had greatly varying experiences of the outage, depending on the degree to which they relied on Fastly services and how they reacted to the outage.

For example, Reddit went down completely and stayed down for the whole duration of the outage, which lasted nearly an hour, according to a report by ThousandEyes.

But the New York Times was able to reduce downtime by sending users directly to its site servers, which were hosted in the Google Cloud Platform. It still took time to make the fix, and time for the updated DNS records to propagate.

“Depending on how long-lived your DNS records are, that can influence how quickly you’re able to help your users,” says Medina.

Amazon uses multiple content distribution networks, including its own Cloudfront CDN and Akamai. When Fastly went down, it was able to reroute requests to other CNDs, significantly reducing the impact of the outage.

Similarly, eBay used Fastly for only some content, specifically individual objects on web pages. The company used Akamai to deliver the web pages themselves. Over the course of the outage, eBay was able to redirect requests away from Fastly and was eventually able to reduce the impact of the outage even further.

4. Akamai Edge DNS: July 22

Akamai is a global content delivery network, similar to Fastly in number of users and market share. And, as with the Fastly outage, companies that used multiple CDNs saw less impact from the outage.

In the case of this particular outage, the Akamai DNS service, which directs users to Akamai’s CDN network, went down for over an hour. According to the company, a software configuration update triggered a bug in their Secure Edge Content Delivery Network impacting that network’s domain name service system.

Many major websites were affected, including Steam, American Airlines, Fox News, and HSBC. Amazon, which uses multiple CDNs, was able to reroute traffic and spared users any impact.

5. Akamai Prolexic Routed: June 16

The July outage wasn’t the only major outage for Akamai last year. In June, the company saw a breakdown of its DDOS mitigation services, Prolexic Routed, because of an issue with BGP routing.

Some customer websites were unreachable for varying amounts of time, according to ThousandEyes. But by quickly taking the action of removing routes, Prolexic minimized the impact to its customers, and customers were free to reinstate BGP announcements through other providers to route around the issue. “Once that action was taken, customers who quickly restored connectivity to their sites were the ones who had redundant processes already in place.”

“Organizations don’t have to be at the mercy of the availability of any one particular service,” says Medina.

According to Akamai, there were about 500 customers using this DDOS mitigation service. Many were routed automatically, restoring operations within minutes. Most of the rest were manually rerouted soon afterwards. The outage was caused because a routing table was accidentally exceeded.

6. Verizon: Jan. 26

Verizon’s outage was the first major outage of 2021 and hit users from Washington, D.C. to Boston. “A lot of folks may not recall this, but it was pretty significant,” says Medina.

Tens of thousands of customers were left without service as Verizon’s FIOS network went down, including companies and employees working from home.

According to Verizon, the disruption was due to a “software issue” triggered during routine network management activities, and was unrelated to a cut fiber line in Brooklyn, which happened at the same time.

7. Comcast: Nov. 9

Another major outage of an Internet service provider occurred in November, when Comcast’s network backbone in the San Francisco area went down for nearly two hours, then was followed by a more widespread outage that lasted for over an hour across multiple cities in the U.S., including Chicago and Philadelphia and stretching into New Jersey and South Carolina.

Tens of thousands of home and business users of Comcast’s Xfinity network were affected. “There was clearly some internal routing issue,” says Medina.

With both the Comcast and Verizon outages, the lesson is that companies need to have backup connectivity plans, not just for their own services but also for their employees and other key users.

8. Cloudflare Magic Transit: May 3

The May Cloudflare outages is another example of an outage due to a service that is specifically designed to protect companies against outages. Like Amazon’s Prolexic, Cloudflare’s Magic Transit service is meant to help protect customers against DDOS attacks by routing traffic through their network, inspecting it, scrubbing it, and sending it on to where it’s supposed to go.

This particular outage affected Cloudflare infrastructure around the globe, with issues occurring at varying levels for about two hours.

“There were certain customers who were very rapidly able to respond,” says Medina. These customers caught that there was a problem with the BGP routing and quickly advertised new routes, she says. “Having early awareness of what’s going on and also redundancy – even if the outage is ongoing – reduces impact on you.”

9. Azure AD: Dec. 15

The most recent of 2021’s major outages was December’s Active Directory outage. Azure’s AD service went down for one and a half hours in mid-December, preventing users from signing into Microsoft services like Office 365.

Some enterprises also use the service for authentication to their own systems and services, says Medina.

“So even though the applications themselves might have been available, users weren’t able to log in,” she says. “Authentication is one of these dependencies that sometimes gets overlooked when you’re thinking about availability.”

10. Facebook: April 8

Finally, rounding out the list is Facebook again, with an April outage that lasted about 40 minutes.

“What was really interesting about this particular incident is it really highlighted how Facebook uses DNS to route users to its service,” says Medina.

In this outage, too many users were routed to just one data center, creating network congestion.

“It took some time for them to normalize the routing across their CDN edge,” she says.

Know your digital supply chain

The key takeaway from all these outages is that companies need to be aware of all the components and dependencies that go into making their systems work, both on the back end, supporting their application servers, and on the front end, delivering data to end users.

“These components give us a lot of options and lots of flexibility and ultimately the power to deliver content across the Internet,” Villemez says.

But it results in a complex set of interconnected services and dependencies, many of which are outside a company’s direct control.

“So, for ITOps teams, it is absolutely critical that we know not just our direct dependencies, but also those indirect ones,” he says. Then, companies need to plan ahead for the failure of any of these critical components. “Know how you can work around the problem while providers are trying to resolve something,” he says.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2022 IDG Communications, Inc.



Source link