10 things to know about data-center outages
Data-center outage severity appears to be falling, while the cost of outages continues to climb.
Power failures are “the biggest cause of significant site outages”.
Network failures and IT system glitches also bring down data centers, and human error often contributes.
Those are some of the problems pinpointed in the most recent Uptime Institute data-center outage report that analyzes types of outages, their frequency, and what they cost both in money and consequences.
Unreliable data is an ongoing problem
Uptime cautions that data relating to outages should be treated skeptically given the lack of transparency of some outage victims and the quality of reporting mechanisms. “Outage information is opaque and unreliable,” said Andy Lawrence, executive director of research at Uptime, during a briefing about Uptime’s Annual Outages Analysis 2023.
While some industries, such as airlines, have mandatory reporting requirements, there’s limited reporting in other industries, Lawrence said. “So we have to rely on our own means and methods to get the data. And as we all know, not everybody wants to share details about outages for a whole variety of reasons. Sometimes you get a very detailed root-cause analysis, and other times you get pretty well nothing,” he said.
The Uptime report culled data from three main sources: Uptime’s Abnormal Incident Report (AIRs) database; its own surveys; and public reports, which include news stories, social media, outage trackers, and company statements. The accuracy of each varies. Public reports may lack details and sources might not be trustworthy, for example. Uptime rates its own surveys as producing fair/good data, since the respondents are anonymous, and their job roles vary. AIRs quality is deemed very good, since it comprises detailed, facility-level data voluntarily shared by data-center owners and operators among their peers.
Outage rates are shrinking slightly
There’s evidence that outage rates have been gradually falling in recent years, according to Uptime.
That doesn’t mean the total number of outages is shrinking—in fact, the number of outages globally increases each year as the data-center industry expands. “This can give the false impression that the rate of outages relative to IT load is growing, whereas the opposite is the case,” Uptime reported. “The frequency of outages is not growing as fast as the expansion of IT or the global data-center footprint.”
Overall, Uptime has observed a steady decline in the outage rate per site, as tracked through four of its own surveys of data-center managers and operators conducted from 2020 to 2022. In 2022, 60% of survey respondents said they had an outage in the past three years, down from 69% in 2021 and 78% in 2020.
“There seems to be a gently, gently improving picture of the outage rate,” Lawrence said.
Outage severity appears to be decreasing
While 60% of data-center sites have experienced an outage in the past three years, only a small proportion are rated serious or severe.
Uptime measures the severity of outages on a scale of one to five, with five being the most severe. Level 1 outages are negligible and cause no service disruptions. Level five mission-critical outages involve major and damaging disruption of services and/or operations and often include large financial losses, safety issues, compliance breaches, customer losses. and reputational damage.
Level 5 and Level 4 (serious) outages historically account for about 20% of all outages. In 2022, outages in the serious/severe categories fell to 14%.
A key reason is that data-center operators are better equipped to handle unexpected events, according to Chris Brown, chief technical officer at Uptime. “We’ve become much better at designing systems and managing operations to a point where a single fault or failure does not necessarily result in a severe or serious outage,” he said.
Today’s systems are built with redundancy, and operators are more disciplined about creating systems that are capable of responding to abnormal incidences and averting outages, Brown said.
The financial toll is rising
When outages do occur, they are becoming more expensive—a trend that is likely to continue as dependency on digital services grows.
Looking at the last four years of Uptime’s own survey data, the proportion of major outages that cost more than $100,000 in direct and indirect costs is increasing. In 2019, 60% of outages fell under $100,000 in terms of recovery costs. In 2022, just 39% of outages cost less than $100,000.
Also in 2022, 25% of respondents said their most recent outage cost more than $1 million, and 45% said their most recent outage cost between $100,000 and $1 million.
Inflation is part of the reason, Brown said; the cost of replacement equipment and labor are higher.
More significant is the degree to which companies depend on digital services to run their businesses. The loss of a critical IT service can be tied directly to disrupted business and lost revenue. “Any of these outages, especially the serious and severe outages, have the ability to impact multiple organizations, and a larger swath of people,” Brown said, “and the cost of having to mitigate that is ever increasing.”
Third-party providers are behind most high-profile, public outages
As more workloads are outsourced to external service providers, the reliability of third-party digital infrastructure companies is increasingly important to enterprise customers, and these providers tend to suffer the most public outages.
Third-party commercial operators of IT and data centers—cloud providers, digital service providers, telecommunications providers—accounted for 66% of all the public outages tracked since 2016, Uptime reported. Looked at year-by-year, the percentage has been creeping up. In 2021 the proportion of outages caused by cloud, colocation, telecommunications, and hosting companies was 70%, and in 2022 it was up to 81%.
“The more that companies push their IT services into other people’s domain, they’re going to have to do their due diligence—and also continue to do their due diligence” even after the deal is struck,” Brown said.
Human error is a frequent contributor to outages and a relatively simple factor to address
While it’s rarely the single or root cause of an outage, human error plays some role in 66% to 80% of all outages, according to Uptime’s estimate based on 25 years of data. But it acknowledges that analyzing human error is challenging. Shortcomings such as improper training, operator fatigue, and a lack of resources can be difficult to pinpoint.
Uptime found that human error-related outages are mostly caused either by staff failing to follow procedures (cited by 47% of respondents) or by the procedures themselves being faulty (40%). Other common causes include in-service issues (27%), installation issues (20%), insufficient staff (14%), preventative maintenance-frequency issues (12%), and data-center design or omissions (12%).
On the positive side, investing in good training and management processes can go a long way toward reducing outages without costing too much.
“You don’t need to go to a banker and get a bunch of capital money to solve these problems,” Brown said. “People need to make the effort to create the procedures, test them, make sure they’re correct, train their staff to follow them, and then have the oversight to ensure that they truly are following them.”
“This is the low hanging fruit to prevent outages, because human error is implicated in so many,” Lawrence said.
Power problems continue to hamper data-center reliability
Uptime said its current survey findings are consistent with previous years’ and show that on-site power problems remain the biggest cause of significant site outages by a large margin. This despite the fact that most outages have several causes, and that the quality of reporting about them varies.
In 2022, 44% of respondents said power was the primary cause of their most recent impactful incident or outage. Power was also the leading cause of significant outages in 2021 (cited by 43%) and 2020 (37%)
Network issues, IT system errors, and cooling failures also stand out as troubling causes, Uptime said.
Network complexity leads to more outages
Uptime used its own data, from its 2023 Uptime resiliency survey, to dig into network outage trends. Among survey respondents, 44% said their organization had experienced a major outage caused by network or connectivity issues over the past three years. Another 45% said no, and 12% didn’t know.
The two most common causes of networking- and connectivity-related outages are configuration or change management failure (cited by 45% of respondents) and a third-party network provider’s failure (39%).
Uptime attributed the trend to today’s network complexity. “In modern, dynamically switched and software-defined environments, programs to manage and optimize networks are constantly revised or reconfigured. Errors become inevitable, and in such a complex and high-throughput environment, frequent small errors can propagate across networks, resulting in cascading failures that can be difficult to stop, diagnose, and fix,” Uptime reported.
Other common causes of major network-related outages include:
- Hardware failure: 37%
- Line breakages: 27%
- Firmware/software error: 23%
- Cyberattack: 14%
- Network/congestion failure: 12%
- Weather-related incident: 7%
- Corrupted firewall/routing table issues: 6%
Common causes of IT system and software outages
When Uptime asked respondents to its resiliency survey if their organization experienced a major outage caused by an IT systems or software failure over the past three years, 36% said yes, 50% said no, and 15% didn’t know. The most common causes of outages related to IT systems and software are:
- Configuration/change management issue: cited by 64%
- Firmware/software fault: 40%
- Hardware failure: 36%
- Capacity/congestion issue: 22%
- Data synchronization/corruption: 14%
- Cyberattack/security issue: 10%
Data-center fires aren’t common but can be devastating
Publicly recorded outages, which include outages that are reported in the media, reveal a wide range of causes. The causes can differ from what data-center operators and IT teams report, since the media sources’ knowledge and understanding of outages depends on their perspective. “What’s really interesting is the sheer variety of causes, and that’s partly because this is how the public and the media perceive them,” Lawrence said.
Fire is one cause that showed up among publicly reported outages but didn’t rank highly among IT-related sources. Specifically, Uptime found that 7% of publicly reported data-center outages were caused by fires. In the web briefing, Uptime researchers related the incidence of data-center fires to increasing use of lithium-ion (Li-ion) batteries.
Li-ion batteries have a smaller footprint, simpler maintenance, and longer lifespan compared to lead-acid batteries. However, Li-ion batteries present a greater fire risk. A Maxnod data center in France suffered a devasting fire on March 28, 2023, and “we believe it’s caused by lithium-ion battery fire,” Lawrence said. A lithium-ion battery fire is also the reported cause of a major fire on Oct. 15, 2022, at a South Korea colocation facility owned by SK Group and operated by its C&C subsidiary.
“We find, every time we do these surveys, fire doesn’t go away,” Lawrence said.
Copyright © 2023 IDG Communications, Inc.