Facebook Blames Global Outage on Configuration Error


Facebook has apologized for a major global outage that left users unable to access the social network and other platforms for hours, blaming the incident on a configuration error.

The outage began at around 11.40 Eastern Time on Monday morning and lasted well into the evening of the same day — affecting not just Facebook and Messenger but Instagram and WhatsApp.

The recovery effort was also impacted as Facebook engineers found it difficult to access internal tooling which used the same internet infrastructure. Global staff were left high-and-dry for similar reasons.

The issue appears to have stemmed from an update to the firm’s Border Gateway Protocol (BGP) records. BGP is critical to the seamless functioning of the internet, allowing networks of addresses such as Facebook’s to advertise their presence to others.

“It’s a mechanism to exchange routing information between autonomous systems (AS) on the internet,” explained Cloudflare in a technical blog about the incident.

“The big routers that make the internet work have huge, constantly updated lists of the possible routes that can be used to deliver every network packet to their final destinations. Without BGP, the internet routers wouldn’t know what to do, and the internet wouldn’t work.”

Although some commentators had speculated foul play, the cause of the outage appears to be human error..

Vice president of infrastructure, Santosh Janardhan, said no user data was compromised and that the root cause of the issue was a “faulty configuration change.”

“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our datacenters caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our datacenters communicate, bringing our services to a halt,” he explained.

“People and businesses around the world rely on us every day to stay connected. We understand the impact outages like these have on people’s lives, and our responsibility to keep people informed about disruptions to our services. We apologize to all those affected, and we’re working to understand more about what happened today so we can continue to make our infrastructure more resilient.”



Source link