Reverse debugging at scale – Fb Engineering
For example, let’s say you received an email notification that a service crashed immediately after you deployed your last code change. The crash only occurs on 0.1 percent of the servers it is running on. But you’re in a large company, so 0.1 percent equals thousands of servers – and this problem will be difficult to reproduce. A few hours later, you still cannot reproduce it and you have spent a full day tracking this problem.
This is where reverse debugging comes in. With the methods available, engineers can record a stopped (or crashed) program, then rewind and replay it to determine the root cause. For large companies like Facebook, however, these solutions are too much of a hassle to be practical in production. To fix this, we developed a new technique that allows engineers to track a failed run and examine its history to find the root cause without having to run the program again, which saves a tremendous amount of time. To do this, we efficiently track the CPU activity on our servers and, in the event of a crash, save the process history, which is later displayed in a human-readable format with the help of LLDB Debugger that offers everything from command history views to reverse debugging.
Sounds great, but how does it work?
In order to develop this further, we have achieved leverage Intel processor trace (Intel PT) that uses specialized hardware to speed up the capture of the steps in a program so we can use them in production and at scale. However, keeping track of a program efficiently is only the first step.
Continuous collection and quick storage in production
Since we don’t know when a crash is occurring or which process is crashing, we need to continuously collect an Intel PT trace of all running processes. In order to limit the memory required for operation, we store the trace in a ring buffer in which new data overwrite old data.
When multiple processes (A and B) are running at the same time, the trace data is stored in the buffer. At t + 8, the data from process B begins to overwrite the oldest data (process A) in the buffer.
In large servers like those that were used to train AI modelsIntel PT can generate hundreds of megabytes of data per second even in its compressed format. When a process crashes, the collector must stop the collection and copy the contents of the buffer for the crashed process. To do this, the operating system has to notify our collector that a process has crashed, which takes some time. The longer the delay between the crash and the copy of the buffer contents, the more data will be overwritten by new data from other processes.
To fix this we have one eBPF Kernel Probe, a program that is triggered on certain kernel events to inform our collector of a crash within a few milliseconds, thus maximizing the amount of information in the trace relevant to the crash. We tried different approaches, but none was as quick as this one. An additional design consideration is that crashes often leave the computer in poor condition, which prevents us from analyzing the traces we have collected on the fly. For this reason, we decided to save the traces and the corresponding binaries of the crashed processes in our data centers so that they can be analyzed later on another computer.
Decoding and symbology
How do we turn this data into instructions that can be analyzed? Engineers cannot decipher raw traces. People need context like source line information and function names. To add context, we have created a component in LLDB that can receive a trace and reconstruct the instructions along with their symbols. We recently started open sourcing this work. You can find it Here.
A trace mainly contains information about which branches were taken and which were not. We compare this information with all instructions from the original binary file and reconstruct the instructions that were executed by the program. Later, with the help of the LLDB symbology stack, we can retrieve the appropriate source code information and display it legibly to the engineer.
With the flow mentioned in the picture above, we can convert the raw trace file to:
Symbolizing the instructions is only the first part of the LLDB’s work. The next part is the reconstruction of the function call history. For this purpose, we have created an algorithm that is able to go through the instructions and create a tree of function calls, ie which function called which other function and when. The most interesting thing is that traces can start at any point in the program execution and not necessarily at the beginning.
Example of a function call tree in which each vertical segment contains instructions and call points are indicated with arrows.
This tree can help us quickly answer questions such as: B. The stack trace at a certain point in the story, which is solved simply by climbing the tree. On the other hand, figuring out where stepping backwards ends is a more complicated question.
Imagine you are on line 16 of this little piece of code. When jumping backwards, you should stop at either line 13 or 15, depending on whether the if was taken or not. A naive approach would be to check the instructions in the history one by one until one of these lines is found. However, this can be very inefficient as a function foo can contain millions of instructions. Instead, we can use the above tree to move across lines without jumping into calls. Unlike the previous approach, traversing the tree is almost trivial.
Breakpoint support has also been added. For example, suppose you are in the middle of a debugging session and want to go back in time to the last call to function_a. Then you could do:
Ultimately, we plan to incorporate this flow into VSCode at some point for a full visual reverse debugging experience.
Latency analysis and profiling
An execution trace contains more information than other representations of the control flow, such as: B. Call stacks. In addition, as part of our traces, we record timing information with great accuracy, so that engineers not only know the sequence of the processes (or functions) carried out, but also their time stamps, thus enabling a new use case: latency analysis.
Imagine a service that handles various requests that retrieve data using an internal cache. Under certain circumstances the cache needs to be cleared and the next fetch will take much longer.
The server is receiving multiple requests from engineers and you want to understand the long tail of latency in your service (e.g. P99). You get your profiler and collect some sampled call stacks for the two main request types. They look like the following graphic, which does not indicate how different they are, because the sample selection for call stacks only shows aggregated numbers. We need to convert traces into execution path information and symbolize the instructions along the path.
The image below shows a trace that makes it easy to understand what is happening: Request B clears data before it is retrieved.
While a debugger can help you step inside the trace, a visualization can help you see the bigger picture easily. For this reason, we are also working on including graphs such as the aforementioned “Call Stack Trace” in our performance analysis tool. Tracery. We are building a tool that can generate this tracking and time data and combine it with the symbology generated by LLDB. Our goal is to give developers the ability to see all of this information at a glance and then zoom in and out to get the data they need at the level they need.
Do you remember the unwanted scenario at the beginning of this post? Now imagine that you received an email notification letting you know that your service is crashing on 0.1 percent of computers. This time, however, the Reverse Debug on VSCode button appears. You click on it and then move through the program until you find the function call that shouldn’t have been made, or an if clause that shouldn’t have been executed, and fix it within a few minutes.
This type of technology allows us to solve problems at a level that we previously didn’t think was possible. Most of our historical work in this area resulted in significant overhead costs that limited our scalability. We are now developing debugging tools that can interactively and visually help developers solve production and development problems much more efficiently. This is an exciting endeavor that will help us scale even more.