Extra particulars concerning the October four outage

Now that our platforms are up and running as usual after yesterday’s outage, I thought it would be worth sharing a little more details about what happened and why – and most importantly, how we learn from it.

This outage was caused by the system that manages our global backbone network capacity. The backbone is the network Facebook built to connect all of our data centers.

These data centers come in a variety of forms. Some are massive buildings that house millions of machines that store data and run the heavy computational loads that keep our platforms running, and others are smaller facilities that connect our backbone network to the wider internet and the people who use our platforms , associate.

When you open one of our apps and load your feed or your messages, the data request from the app moves from your device to the next facility, which then communicates directly with a larger data center via our backbone network. This is where the information your app needs is retrieved, processed and sent back to your phone over the network.

The data traffic between all these computing systems is managed by routers, which find out where all incoming and outgoing data is being sent. And with the extensive daily work involved in maintaining this infrastructure, our engineers often have to take part of the backbone offline for maintenance work – for example to repair a fiber optic line, add more capacity or update the software on the router itself.

This was the source of yesterday’s failure. During one of these routine maintenance tasks, an order was issued with the intent to assess the availability of global backbone capacity that inadvertently disconnected all connections on our backbone network and effectively disconnected Facebook data centers around the world. Our systems are designed to audit commands like this to avoid errors like this, but a bug in this audit tool did not properly complete the command.

This change resulted in a complete disconnection of our server connections between our data centers and the internet. And this total loss of connection created a second problem that made matters worse.

One of the jobs of our smaller facilities is to respond to DNS queries. DNS is the address book of the Internet that makes it possible to translate the simple web names we enter in browsers into specific server IP addresses. These translation requests are answered by our authoritative name servers, which themselves occupy known IP addresses, which in turn are disclosed to the rest of the Internet via another protocol called the Border Gateway Protocol (BGP).

To ensure reliable operation, our DNS servers deactivate this BGP advertising if they cannot speak to our data centers themselves, as this is an indication of an unhealthy network connection. The recent outage took the entire backbone out of service, declaring these locations unhealthy and withdrawing this BGP advertisement. The end result was that our DNS servers were inaccessible even though they were still operational. This made it impossible for the rest of the internet to find our servers.

All of this happened very quickly. And when our engineers tried to figure out what was happening and why, they faced two major obstacles: firstly, it was not possible to access our data centers by our normal means because their networks were down, and secondly, the total DNS loss collapsed many of the internal tools that we would normally use to investigate and fix such failures.

Our primary and out-of-band network access was down, so we sent on-site technicians to the data centers to fix the problem and reboot the systems. However, this took time as these facilities are designed to provide a high level of physical and system security. They’re difficult to access, and once inside the hardware and routers are designed to be difficult to modify even if you have physical access to them. As a result, it took additional time to enable the secure access protocols required for employees to be on site and work on the servers. Only then could we confirm the problem and bring our backbone back online.

Once our backbone network connectivity was restored in our data center regions, everything was back. But the problem wasn’t over yet – we knew that turning our services back on at the same time due to a surge in traffic could potentially lead to new crashes. Individual data centers reported drops in power consumption in the tens of megawatts, and a sudden reversal of such a drop in power consumption could endanger everything from electrical systems to caches.

Helpfully, this is an event that we are well prepared for thanks to the “storm” exercises that we have been doing for a long time. In a storm exercise, we simulate a serious system failure by taking a service, a data center or an entire region offline and subjecting the entire infrastructure and software to a stress test. The experience from these exercises gave us the confidence and experience to get things back online and carefully manage the increasing loads. In the end, our services came back relatively quickly without any further system-wide failures. And while we’ve never before seen a storm simulating our global backbone being taken offline, we will surely be looking for ways to simulate events like this in the future.

Every mistake like this is an opportunity to learn and get better, and we can learn a lot from it. After every problem, big and small, we run a comprehensive review process to understand how we can make our systems more resilient. This process is already under way.

We’ve done an extensive amount of work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from a failure caused not by malicious activity but by a caused the error caused by us. I think a compromise like this is worth it – greatly increased daily safety vs. slower recovery from a hopefully rare event like this. From now on, our job is to strengthen our tests, exercises, and general resilience to ensure that such events happen as infrequently as possible.

Comments are closed.