What took Facebook down: Major global outage drags on

The old network troubleshooting saying is, when anything goes wrong, “It’s DNS.” This time Domain Name Server (DNS) appears to be the symptom of the root cause of the Facebook global failure. The true cause is that there are no working Border Gateway Protocol (BGP) routes into Facebook’s sites.

BGP is the standardized exterior gateway protocol used to exchange routing and reachability information between the internet top-level autonomous systems (AS). Most people, indeed most network administrators, never need to deal with BGP.

Many people spotted that Facebook was no longer listed on DNS. Indeed, there were joke posts offering to sell you the Facebook.com domain.

Cloudflare VP Dane Knecht was the first to report the underlying BGP problem. This meant, as Kevin Beaumont, former Microsoft’s Head of Security Operations Centre, tweeted, “By not having BGP announcements for your DNS name servers, DNS falls apart = nobody can find you on the internet. Same with WhatsApp btw. Facebook have basically deplatformed themselves from their own platform.”

Whoops.

As annoying as this is to you, it may be even more annoying to Facebook employees. There are reports that Facebook employees can’t enter their buildings because their “smart” badges and doors were also disabled by this network failure. If true, Facebook’s people literally can’t enter the building to fix things.

In the meantime, Reddit user u/ramenporn, who claimed to be a Facebook employee working on bringing the social network back from the dead, reported, before he deleted his account and his messages, that “DNS for FB services has been affected and this is likely a symptom of the actual issue, and that’s that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC).”

He continued, “There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified. Part of this is also due to lower staffing in data centers due to pandemic measures.”

Ramenporn also stated that it wasn’t an attack, but a mistaken configuration change made via a web interface. What really stinks — and why Facebook is still down hours later — is that since both BGP and DNS are down, the “connection to the outside world is down, remote access to those tools don’t exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally.” Of course, the technicians on site don’t know how to do that and senior network administrators aren’t on site. This is, in short, one big mess.

As a former network admin who worked on the internet at this level, I anticipate Facebook will be down for hours more. I suspect it will end up being Facebook’s longest and most severe failure to date before it’s fixed.

Related Stories:

Source link

What took Facebook down: Major global outage drags on | ZDNet

VMWARE

Helping Public Sector Organisations Define Cloud Strategy

How to change the VLAN ID of the Service Console in ESX from the command line/console

Cisco UCS and Vmware Interfaces (Vnics) HA Design Considerations

Troubleshooting network and TCP/UDP port connectivity issues on ESX/ESXi(2020669)

vSphere Client Parameters

Configuration Templates

CUE Licenses

Trouble shooting Unity Express with Call Manager Integeration & Operational Issues

CME Configuration Example: SIP Trunks to Viatalk and VoIP.ms

SIP Phone registration – CME Configuration

CUE Voicemail + VPIM networking (CUE to unity)

Related Post

VMWARE

Configuration Templates