Community hose: Managing unsure demand
Our production Backbone network connects our data centers and delivers content to our users. The network supports a large number of different services that are distributed across a large number of data centers. Traffic patterns shift from one data center to another over time due to the introduction of new services, changes in service architecture, changes in user behavior, new data centers, etc. As a result, we have seen exponential and highly fluctuating traffic demand for many years.
In order to meet the expectations of the service bandwidth, we need an accurate long-term demand forecast. However, due to the nature of our services, the fluidity of workloads, and the difficulty of predicting future service behavior, it is difficult to predict future traffic between each data center pair (i.e., the traffic matrix). To address this traffic uncertainty, we made changes to the design methodology that remove our reliance on predicting the future traffic matrix. We achieve this by designing the production network for an aggregate traffic to or from a data center called a network hose. By planning for net hoses, we reduce the forecast complexity by an order of magnitude.
The traditional approach to network planning
The traditional approach to network design is to dimension the topology to match a given traffic matrix under a set of potential faults that we define using a fault protection policy. In this approach:
- Traffic matrix is the volume of traffic that we forecast between any two data centers (pairwise demand).
- Error protection policy is a series of failures that are commonly observed on any large network, such as single fiber failure or dual submarine connection failure, or a series of failures that have occurred multiple times in the past (z).
- We use a cost optimization model to calculate the network capacity plan. In essence, the well-known integer linear programming formulation ensures the availability of capacities to serve the traffic matrix for all errors defined in our guideline.
What is the problem?
The following reasons prompted us to rethink the classic approach:
- Lack of long-term loyalty: Building backbone network capacity requires longer lead times, typically on the order of months or even years, when purchasing or building a terrestrial fiber route or building a submarine cable across the Atlantic. Given the growth and dynamism of our services in the past, it is difficult to predict service behavior for a period longer than six months. Our original approach was to handle traffic uncertainties by sizing the network for worst-case assumptions and sizing for a higher percentile, say P95.
- Asking each service owner to provide a traffic estimate per data center pair is difficult to manage. In the traditional approach, a service owner must provide an explicit requirements specification. This is disheartening because not only do we see changes in current service behavior, but we also don’t know which new services will be introduced and will consume our network in a period of a year or more. The problem of the exact traffic forecast is even more difficult in the long term, since the upcoming data centers are not even in operation at the time the forecast is requested.
- Abstracting the network as a consumable resource: A service typically requires computing, storage, and network resources to operate from a data center. Each data center has a known computing and storage resource that can be distributed across various services. A service owner can justify the short- and long-term requirements for these resources and view them as a consumable unit per data center. However, this does not apply to the network because the network is a shared resource. It is desirable to create a planning method that can abstract the complexity of the network and present services like any other consumable unit per data center.
- Company migration: It is becoming increasingly difficult to track each service’s surge in traffic, identify its cause, and assess its potential impact. Most of these surges are harmless because not all services rise at the same time. Nonetheless, this still creates an overhead for keeping track of many false alarms.
Mesh hose planning model:
Instead of forecasting the traffic for each (source, destination) pair, we do a traffic forecast for all outbound and all inbound traffic per data center i.e. Network hose. Instead of asking how much traffic a service would generate from X to Y, we ask how much inbound and outbound traffic a service is likely to generate from X. Therefore we replace the O (N ^ 2) data points per service with O (N). When planning for aggregated traffic, we naturally include statistical multiplexing in the forecast.
The above figure reflects the change in the input for the planning problem. Instead of a classic traffic matrix, we now only have a traffic hose forecast and generate a network plan that supports it in the event of all failures defined by the failure policy.
Solve the planning challenge
While The network tubing model concisely captures the end-to-end demand uncertainty, but it presents a different challenge: dealing with the many demand sets and realizing the tubing restrictions. In other words, if we take the convex polytope of all the demand sets that meet the tube condition, it has a continuous space to treat within the polytope. Usually this would be useful for an optimization problem, as we can use linear programming techniques to solve it effectively. However, the main difference in this model is that each point within the convex polytope is a single traffic matrix. The long term network plan must meet all of these sets of requirements if we are to meet the hose restriction. This represents an enormous computational challenge, since the design of a cross-shift global production network already represents an intensive optimization problem for a single set of requirements.
The above reasons drive the need to intelligently identify some demand quantities from this convex polytope that can serve as reference demand quantities for the network design problem. When looking for these reference demand quantities, we are interested in a few basic properties that they should meet:
- These are the demand rates that are likely to increase the demand for additional resources in the production network (fiber optics and equipment).
- If we design the network explicitly for this subset, there is a high probability that we want to guarantee that we will cover the remaining demand.
- The number of reference demand sets should be as small as possible to reduce the state space of the cross-tier network design problem.
To identify these reference requirement sets, we use the cuts in the topology and location (latitude, longitude) of the data centers to gain insight into the maximum flow that a network interruption can traverse. The following example shows a network interface in a sample topology. This network intersection divides the topology into two sets of nodes (1,2,3,4) and (5,6,7,8). To dimension the link in this network section, we only need a traffic matrix that generates the maximum traffic over the graph section. All other traffic matrices with less or the same traffic over this intersection are then permitted over the graph intersection without additional bandwidth requirements.
Note that in a topology with N nodes we can create 2 ^ N network intersections and have one traffic matrix per intersection. However, the geographic nature of these cuts is essential given the planar nature of our topology. It turns out that simple cuts (typically a straight cut) are more critical to dimensioning the topology than more complicated cuts. As the following figure shows, a traffic matrix for each of the simple sections is more meaningful than a traffic matrix for the “complicated” sections: A complex section is already taken into account by a number of simple sections.
By concentrating on the simple cuts, we reduce the number of reference demand quantities to the smallest amount of the traffic matrix. We solve these traffic matrices with a cost optimization model and create a network plan that supports all possible traffic matrices. Based on simulations, we find that, due to the nature of our topology, the additional capacity required to maintain the tube-based traffic matrix is not significant, but it greatly simplifies our planning and operations.
By adopting a hose-based network capacity planning methodology, we have reduced forecasting complexity by an order of magnitude, enabled services to analyze the network like any other consumable unit, and simplified operations by eliminating a significant number of alarms related to traffic increases because we are tracking You now see the increase in traffic in an aggregated form from one data center rather than between two data centers.
Lots of people contributed to this project, but we’d especially like to thank Tyler Price, Alexander Gilgur, Hao Zhong, Yesing zhang, Alexander Nikolaidis, Gaya Nagarajan, Steve Politis, Biao Lu, Ying Zhang and Abhinav Triguna for being instrumental in making this project a reality.