Twine: A ubiquitous shared infrastructure
What the research is:
line is our self-developed cluster management system that has been in production for a decade. A cluster management system assigns workloads to machines and manages the life cycle of machines, containers and workloads. Governors is a prominent example of an open source cluster management system. Twine has helped transform our infrastructure from a collection of isolated pools of customized machines for individual workloads to an ubiquitous, large-scale shared infrastructure in which any machine can run any workload.
We had extensive discussions with colleagues from industry and found that while partial consolidation of the workload is common, no large organization has achieved ubiquitous shared infrastructure. To achieve this goal, we made several unconventional choices, including:
- We scale a single Twine control plane to manage one million machines in data centers in a geographic region while maintaining high reliability and guaranteed performance.
- We support workload-specific customizations that allow different workloads to run on a shared infrastructure without affecting performance.
- We developed an interface called TaskControl that allows workloads to collaborate with the Twine control plane to manage software rollouts, kernel upgrades and other operations in the machine life cycle.
- We use energy efficient small machines as a universal computing platform to achieve higher performance per watt. In addition, we are moving our fleet to a single type of computing machine with a CPU and 64 GB of RAM, rather than offering a variety of machine types with high memory or high CPU. By using small machines in our fleet, we were able to save 18 percent energy and 17 percent total operating costs.
We created Twshared in 2013, but adoption was limited for the first six years. We restarted the adoption efforts in 2018 and expanded Twine with the new features highlighted in the picture. As of October 2020, 56 percent of our fleet was hosted on twshared, up from 15 percent in January 2019. We assume that all computing services, about 85 percent of our fleet, will be hosted on twshared by early 2022, while the remaining 15 percent will run in a separate shared storage pool. Twshared has become our ubiquitous computing pool, as all new computing capacities are only available in Twshared.
How it works:
The following features have allowed us to quickly consolidate various workloads into Twshared.
Scale a single twine control plane to manage up to a million machines in a region::
In many cluster management systems, machines are typically statically assigned to clusters and workloads are tied to clusters. Isolated small clusters create stranded capacity and operational stress because workloads cannot be easily migrated across clusters. Twine abandons the concept of clusters and uses a single control plane to manage a million machines in all data centers in a geographic region. Twine can easily migrate workloads across data centers without manual intervention. In order to be scaled to millions of machines, all Twine components are disassembled and scaled independently of one another in order to avoid a central bottleneck. In contrast to federation approaches like Association of GovernorsTwine scales natively with no additional composite layer.
Collaborative workload lifecycle management::
Cluster management systems rarely consult an application about its lifecycle management operations, making it more difficult for the application to maintain its availability. In one example, an application might unknowingly restart while another data replica is being created, causing the data to become unavailable. In another example, they may not be aware ZooKeeper’s preference to update followers first and leader last during a software release to minimize the number of leader failovers. Twine offers a new kind of TaskControl API that allows applications to work with Twine to handle container lifecycle events that affect availability. For example, the task controller of an application can communicate with Twine to precisely manage the order and timing of container restarts.
Hardware and operating system adaptations in the shared infrastructure::
Our fleet runs thousands of different applications. On the one hand, a shared infrastructure prefers standard machine configurations so that a machine can easily be reused across applications. On the other hand, applications benefit significantly from adapting the hardware and operating system settings. For example, our web layer achieves an 11 percent higher throughput by changing the kernel settings of the operating system such as Huge pages and CPU planning. We resolve the conflict between customization at the host level and the sharing of computers in a common infrastructure via host profiles. Host profiles capture hardware and operating system settings that can optimize workloads to improve performance and reliability. By fully automating the process of assigning machines to workloads and switching host profiles accordingly, we can perform fleet-wide optimizations (e.g. swap machines across workloads to eliminate hotspots in the network or in the power supply) without compromising the performance or reliability of the Impact workloads.
Why it matters:
Many large organizations run diverse workloads on shared infrastructures because they leverage economies of scale to reduce hardware, development, and operational costs. However, due to the limited scalability of existing cluster management systems and the inability to support workload-specific adaptations in a shared infrastructure, few companies have achieved the goal of using a ubiquitous shared infrastructure to host all workloads. Schnur overcomes these challenges. We hope that sharing our experiences in developing Twine will improve the state of the art in cluster management and help other organizations make further advances in consolidating shared infrastructures.
Further information can be found in our presentation of OSDI 2020.
Read the full paper:
Twine: A uniform cluster management system for the shared infrastructure