Spine Administration at Fb – Fb Engineering

What the research is:

A unique study detailing our backbone management strategy to ensure high levels of service during the COVID-19 pandemic. The pandemic has shifted most of the social interactions online and caused an unprecedented stress test in our global network infrastructure with dozens of data center regions. At this magnitude, failures such as fiber breakthroughs, incorrect router configurations, and power outages are common.

We operated a simulation system that identified potential failures and quantified their potential severity using a range of metrics to measure network risk. The risk metrics, in turn, guided operational decisions for the provision of capacity. In connection with the management of traffic priority and the proactive capacity improvement, our backbone withstood the COVID-19 stress test while achieving high service availability and low latency times and efficiently coping with traffic peaks.

How it works:

In order to meet the service level objectives (SLO) of the network, we first defined a set of risk metrics for loss of demand, availability and latency expansion. All of these metrics are calculated in relation to possible failure scenarios in the network, which can be enumerated by traversing all components of the network. The aim of error modeling is to estimate the probability of an error scenario and the duration of the error event. Each component failure is characterized by its mean time between failures and the mean time to repair. These statistics are estimated based on a combination of historical data and clustering, followed by Bayesian regression modeling for common characteristics such as provider, ownership, and geographic region.

Our risk simulation system periodically calculates the above risk indicators. It works by taking a new snapshot of the network topology and requirement as well as the failure scenarios to consider along with their failure characteristics as input. Because of the high number of failure scenarios, each one is split up into a number of worker jobs that run the same code as ours SD-WAN controller to calculate the traffic-related decision for the given failure scenario. The decisions are aggregated to derive the risk metrics and then logged for continuous monitoring.

At the beginning of COVID-19, the risk metrics showed a significant increase in loss of demand (which captures the highest traffic loss in all simulated failure scenarios), a decrease in availability, and an increase in latency for all quality of service (QoS). The risk indicators led us to the possible failure scenarios, the occurrence of which would worsen the network operating conditions for certain regions. Capacity has been proactively provided to mitigate these risks. Another helpful technique was to examine the traffic flows from the endangered regions, differentiate the traffic according to criticality, and then downgrade the QoS to a lower priority. In the order of their importance, the QoS classes are infrastructure control (class 1), user traffic (class 2), internal applications (class 3) and mass data transfer (class 4). We downgraded a lot of latency-insensitive traffic from class 3 to class 4. This means that less capacity is required to ensure the same SLO level.

Why it matters:

The development of capacities for backbone networks is associated with a long lead time of months to years. As such, network operators typically procure capacity based on estimated traffic growth. When COVID-19 hit, there was a significant unplanned spike in traffic in a short space of time, putting a strain on backbone infrastructure around the world.

Thanks to its risk-oriented backbone management strategy, Facebook was able to react quickly. With the help of the risk metrics calculated by our simulation systems, we quickly identified the operational weaknesses and prioritized capacity improvements in order to get the network back to normal. Our experience has shown that a metric-centric approach to backbone management could adapt to rare negative external shocks. We hope our research can help operators build a more resilient network. We thank Ying Zhang, Guanqing Yan, Satyajeet Singh Ahuja, Alexander Nikolaidis, Soshant Bali, Bob Kamma, and Gaya Nagarajan for their work on this project.

To find out more, check out our presentation at NSDI 2021.

Read the full paper:

A social network under social distancing: Risk-oriented network management during COVID-19 and beyond

Comments are closed.