What on earth happened to Cloudflare last week?
On November 2, 2023, Cloudflare‘s customer-facing interfaces, including their website and APIs, along with logging and analytics, ceased functioning properly. That was bad.
Over 7.5 million websites use Cloudflare, and 3,280 of the world’s 10,000 most popular websites depend on its content delivery network (CDN) services. The good news is that the CDN didn’t go down. The bad news is that Cloudflare Dashboard and its related application programming interfaces (API) were down for almost two days.
Also: The best VPN services (and how to choose the right one for you)
That kind of thing just doesn’t happen — or it shouldn’t, anyway — to major internet service companies. So, the multi-million-dollar question is: ‘What happened?’ The answer, according to Cloudflare CEO, Matthew Prince, was a power-related incident at a trio of the company’s primary data centers in Oregon, which are managed by Flexential, that cascaded into one problem after another. Thirty-six hours later, Cloudflare was finally back to normal.
Prince didn’t pussyfoot around the problem:
To start, this never should have happened. We believed that we had high availability systems in place that should have stopped an outage like this, even when one of our core data center providers failed catastrophically. And, while many systems did remain online as designed, some critical systems had non-obvious dependencies that made them unavailable. I am sorry and embarrassed for this incident and the pain that it caused our customers and our team.
He’s right — this incident never should have happened. Cloudflare’s control plane and analytics systems run on servers in three data centers around Hillsboro, Oregon. But, they’re all independent of one another; each has multiple utility power feeds, and multiple redundant and independent internet connections.
The trio of data centers is not so close together that a natural disaster would cause them all to crash at once. Simultaneously, they’re still close enough that they could all run active-redundant data clusters. So, by design, if any of the facilities go offline, the remaining ones should pick up the load and keep operating.
Sounds great, doesn’t it? However, that’s not what happened.
What happened first was that a power failure at Flexential’s facility caused unexpected service disruption. Portland General Electric (PGE) was forced to shut down one of its independent power feeds into the building. The data center has multiple feeds with some level of independence that can power the facility. However, Flexential powered up their generators to supplement the feed that was down.
That approach, by the way, for those of you who don’t know data centers’ best practices, is a no-no. You don’t use off-premise power and generators at the same time. Adding insult to injury, Flexential didn’t tell Cloudflare that they’d sort of, kind of, transitioned to generator power.
Also: 10 ways to speed up your internet connection today
Then, there was a ground fault on a PGE transformer that was going into the data center. And, when I say ground fault, I don’t mean a short, like the one that has you going down into the basement to fix a fuse. I mean a 12,470-volt bad boy that took down the connection and all the generators in less time than it took you to read this sentence.
In theory, a bank of UPS batteries should have kept the servers going for 10 minutes, which in turn should have been enough time to crank the generators back on. Instead, the UPSs started dying in about four minutes, and the generators never made it back on in time anyway.
Whoops.
There might have been no one who was able to save the situation, but when the onsite, overnight staff “consisted of security and an unaccompanied technician who had only been on the job for a week,” the situation was hopeless.
Also: The best VPN services for iPhone and iPad (yes, you need to use one)
In the meantime, Cloudflare discovered the hard way that some critical systems and newer services were not yet integrated into its high-availability setup. Furthermore, Cloudflare’s decision to keep logging systems out of the high-availability cluster, because the analytics delays would be acceptable, turned out to be wrong. As Cloudflare’s staff couldn’t get a good look at the logs to see what was going wrong, the outage would linger on.
It turned out that, while the three data centers were “mostly” redundant, they weren’t completely. The other two data centers running in the area did take over responsibility for the high-availability cluster and keep critical services online.
So far, so good. However, a subset of services that were supposed to be on the high-availability cluster had dependencies on services that were running exclusively on the dead data center.
Specifically, two critical services that process logs and power Cloudflare’s analytics — Kafka and ClickHouse — were only available in the offline data center. So, when services in the high-availability cluster called for Kafka and Clickhouse, they failed.
Cloudflare admits it was “far too lax about requiring new products and their associated databases to integrate with the high-availability cluster.” Moreover, far too many of its services depend on the availability of its core facilities.
Lots of companies do things this way, but Prince admitted, this “does not play to Cloudflare’s strength. We are good at distributed systems. Throughout this incident, our global network continued to perform as expected. but far too many fail if the core is unavailable. We need to use the distributed systems products that we make available to all our customers for all our services, so they continue to function mostly as normal even if our core facilities are disrupted.”
Also: Cybersecurity 101: Everything on how to protect your privacy and stay safe online
Hours later, everything was finally back up and running — and it wasn’t easy. For example, almost all the power breakers were fried, and Flexentail had to go and buy more to replace them all.
Expecting that there had been multiple power surges, Cloudflare also decided the “only safe process to recover was to follow a complete bootstrap of the entire facility.” That approach meant rebuilding and rebooting all the servers, which took hours.
The incident, which lasted until November 4, was eventually resolved. Looking forward, Prince concluded: “We have the right systems and procedures in place to be able to withstand even the cascading string of failures we saw at our data center provider, but we need to be more rigorous about enforcing that they are followed and tested for unknown dependencies. This will have my full attention and the attention of a large portion of our team through the balance of the year. And the pain from the last couple of days will make us better.”