Enhancing the efficiency of large-scale functions through primary block reordering

What the research is:

At Meta, we develop compilers to optimize the performance of our extensive applications in data center environments. Profile-driven optimization (PGO) is an important step in modern compilers to improve the performance of large applications based on their runtime behavior. The technique uses program execution profiles, such as the execution frequencies of binary functions or individual instructions, to guide compilers to optimize critical paths of a program more selectively and effectively.

Basic reordering of blocks is one of the most powerful PGO techniques. As part of the compilation process, a binary file is broken down into smaller blocks. Many of these basic blocks run sequentially, but some contain conditional branches (control flow statements such as if-then-else, while, and switch statements) where execution can jump to two or more blocks. Depending on the relative frequency of these jumps, some arrangements of basic blocks can result in fewer errors in the cache of CPU instructions and therefore faster execution.

The source code on the left is converted to the control flow graph in the middle. The blocks of code that make up this graph are arranged on the right in memory.

Profiling is used to gather information about the typical way an application is running. It allows us to know how many times each basic block was executed and how often each branch was taken. Given this information, the job of a compiler is to produce the “most CPU-friendly” arrangement of basic blocks that will result in the best binary performance.

Traditional compiler approaches to basic block reordering optimize a specific dimension of the CPU, such as instruction cache line utilization or branch prediction. However, we have found that such arrangements can produce suboptimal results. To address these shortcomings, we proposed a new block reordering model that combines multiple effects and better predicts the performance of an application. Our model is based on a new optimization problem that we call the Extended Traveling Salesman Problem, or Ext-TSP for short.

How it works:

With a collection of cities and distances between each pair of cities, the classic Traveling Salesman (TSP) problem is to find an order in which to visit the cities in order to minimize the total distance traveled. There are many variations on this problem, such as MAX-TSP where we want to maximize the total distance traveled. Ext-TSP is a generalization of the latter problem, where we want to maximize the distance not only of two neighboring cities, but also cities that are close enough together in order – say, no more than a fixed number of locations away.

In the context of basic block rearrangement, the blocks play the role of cities and the jump numbers play the role of the distances between two cities. The order corresponds to the arrangement of the basic blocks in memory. If two basic blocks are in close proximity, there is a good chance that jumping from one to the other will not cause the instruction cache to fail. In a sense, the Ext-TSP goal aims to optimize the utilization of the instruction cache and thus the performance of the application.

Our paper “Improved basic block reorganization“Introduces the new optimization problem. It shows that it is NP-hard to find the best order, but also that there is a greedily efficient heuristic that provides good solutions for the instances that typically arise from real binary files. In addition, we describe a mixed-integer programming formulation for the optimization problem mentioned above, which is able to find optimal solutions for small functions. Our experiments with the exact method show that the new proposed heuristic finds an optimal arrangement of basic blocks in over 98 percent of the real instances. From a practical point of view, the new basic block subordination was implemented in BOLT, an open source binary optimization and layout tool developed by Meta. A comprehensive assessment of a wide variety of real world data center applications shows that the new approach outperforms existing block reordering techniques and improves the resulting performance of large code size applications.

This image shows a control flow diagram and two block layouts, one maximizing the number of diarrhea jumps and the other maximizing the Ext-TSP lens.

Theory behind extended TSP

Given how well known the original TPS problem was, we wanted to understand how much more difficult Ext-TSP is compared to classic TSP. In our newspaper “On the extended TSP problem“We examine Ext-TSP from a mathematical point of view and prove both negative and positive results on the problem.

On the negative side, it turns out that Ext-TSP is much harder than classic TSP. We prove that, due to the exponential time hypothesis, it is unlikely that there is an efficient algorithm for optimally solving the problem, even for very simple tree-like instances, such as those created from simple programs without loops. This is very surprising since most optimization problems (including the classic TSP) allow such efficient algorithms on trees.

It is positive that we design so-called approximation algorithms that are efficient and deliver a solution that is guaranteed to be worse than the optimal solution by a given fixed factor at most. Given the previous impossibility of having efficient optimal algorithms, such approximation algorithms are the best we can hope for.

Why it matters:

The development of a new compiler technology to optimize the performance of our servers is an important research area for Meta. Faster applications mean that less computing power is required to serve our users, which translates directly into lower power consumption in our data centers and less environmental impact on our operations.

Comments are closed.