Leap-Begin: Boosting VM efficiency – Fb Engineering
What the research is:
Jump-Start is a new approach to improving the performance of virtual machines on a scale. Virtual machines are a modern and popular design for implementing programming languages used to build applications in general, including large websites such as Facebook and Instagram. However, virtual machines incur a known performance overhead in terms of the amount of memory and CPU resources required, especially during the application warm-up phase when the code is profiled and translated from the abstract language of the virtual machine to the code of the real machine a just-in-time compiler (JIT). We developed the jump start method to reduce the overhead of virtual machines during the warm-up phase.
Jump start was successfully implemented in the Virtual hip-hop machine (HHVM) which supports not only Facebook.com but many other websites on the internet as well. HHVM Jump-Start has been deployed in our data centers. Our analysis shows that HHVM’s warm-up overhead for Facebook’s apps and websites is reduced by 54.9 percent. Jump-Start was used to improve the steady-state performance of our website by 5.4 percent, i.e. even after HHVM warmed up. Jump-Start is not only the first technique to fix the warm-up effort of virtual machines on a large scale, but also the first to increase steady-state performance.
How it works:
Like many advanced JIT compilers, the HHVM-JIT compiles the code twice to create high-performance machine code: first to collect profile data about the behavior of the application, and then to create optimized code that uses that profile data. While this approach results in much better steady-state performance, it also incurs a significant warm-up by compiling your code twice and waiting for profile data to be collected.
Jump-Start uses our step-by-step rollout to significantly reduce the warm-up effort for HHVM. Updates are made every few hours to keep Facebook running smoothly and adding new features. Each time our global fleet of web servers (with HHVM) is restarted in three phases. In the first phase (C1) we restart a very small part of the servers. In the second phase (C2) we restart about 2 percent of the servers and finally in phase C3 we restart the remaining servers. This gradual rollout was designed to provide enough signal as to the state of the new version that can be rolled out so that we can pause the rollout if necessary before the update has been rolled out to the entire server fleet.
Jump-Start takes advantage of Facebook’s gradual adoption to avoid some of the overhead of HHVM during the warm-up. In particular, the profile data collected by the servers in C2 are released for all servers in C3. This allows the vast majority of our web servers to skip both compiling profile code and running that code to collect profile data. Because only a small fraction of the server fleet faces the overhead of JIT profiling, Jump-Start allows the application to be more thoroughly profiled. This allowed us not only to improve the effectiveness of the profile driven optimization of the previous HHVM, but also to add new optimizations that further improved the steady state performance of the HHVM.
Why it matters:
The jump start technique has significantly improved the warm-up and steady-state performance of HHVM. These enhancements have enabled Facebook to roll out continuously, which has both improved developer productivity and increased the speed at which new features are being rolled out to Facebook users. By improving HHVM’s performance during the warm-up, Jump-Start also reduces the latency observed by Facebook users and enables seamless website updates. By improving HHVM’s steady-state performance, Jump-Start improves efficiency and reduces the footprint of the server fleet that powers Facebook. Although this work describes and evaluates the jump start technique only in the context of HHVM, the same approach can be used to improve the performance of other virtual machines.
Read the full paper:
HHVM Jumpstart: Boosting Warm Up and Steady State Performance on Scale