Async stack traces in folly: Introduction

This article was written by Lee Howes and Lewis Baker of Facebook.

Facebook’s infrastructure is widely distributed and relies heavily on asynchronous programming. Most of our services written in C ++ are implemented using asynchronous programming frameworks such as folly :: Futures and folly :: coro. Asynchronous programming is an important tool for scaling a process with a relatively small number of threads to handle a much larger number of simultaneous requests. However, these asynchronous programming models typically have some drawbacks.

One of these drawbacks is that existing debugging and profiling tools that use stack traces to provide information about what your program is doing generally produce poor quality data for asynchronous code.

This is the first in a short series of blog posts about how we use C ++ coroutines to improve this situation and describes the technical details of our solution. This first post provides a high-level background, and later posts in the series go pretty deep into the technical details of implementing it in Folly and the tools around it to aid in debugging and profiling coroutine code.

Why asynchronous stack traces?

The library that Facebook has relied on for asynchronous programming for a number of years is Folly’s Futures library, which enables code like the following:

Executor & e = …; future s = makeSemiFuture () .via (& e) .thenValue ([](auto &&) {doWork (); }); doComplexOperation (); future s2 = std :: move (s) .thenValue ([](auto &&) {doMoreWork (); });

where the lambda containing doWork will eventually run in the thread pool represented by e.

Asynchronous code, such as the example above, generally involves starting an operation and then appending a callback or continuation to that operation, which occurs when that operation is complete. doComplexOperation can be executed on the main thread at the same time as doWork on another thread that belongs to the executor. In general, this avoids much of the thread context switching overhead that you get when using thread-per-task parallelism. However, this also means that the callback is often executed in a different context than the context that started it, usually from the event loop of an executor.

Under normal circumstances, when code is started on an executor, such as a thread pool, a stack trace represents the path from the thread’s execution loop to the function being executed. Conceptually, the threads in the executor do something like this:

Executor :: run () {while (! Canceled) {auto t = queue_.getTask (); T (); }}

If we do this in a debugger and pause within doWork, the stack trace looks like this:

– work –
– Executor :: run –
– __thread_start

The trace will only cover stack frames from the main part of doWork to the run method of the executor. The connection to the calling code and to doMoreWork is lost. This loss of context can make debugging or profiling such code very difficult.

It’s difficult to fix this loss of context in futures code like the example above. With a normal stack trace, when one function calls another, the full call stack is retained for the duration of the call, so any stack walks that are performed while the function is executing can be easily traced back through the calling code stop doing any work without the need to do so. With futures, however, the code that starts an operation continues to execute and may unwind its call stack before executing the continuation attached to the future. The calling context is not preserved, and we would effectively have to take a snapshot of the call stack at the start of the process in order to be able to reconstruct the full call stack later, which causes a lot of runtime overhead. Coroutines provide us with a nesting that makes this cleaner:

Executor & e = …; auto l = []() -> task {co_wait doWork (); co_await doMoreWork (); }; l (). scheduleOn (e) .start ();

The compiler converts this to something like the futures code above, but structurally there is one big difference: the next continuation is in the same area as the parent. This makes the idea of ​​a “stack” of coroutines much more meaningful, at least syntactically. Taking advantage of this and helping us debug is still a technical challenge.

Coroutines as callbacks

C ++ uses a kind of coroutine that halts by returning a function: that is, we don’t suspend the entire stack, but rather return from the coroutine and have the suspended state saved separately. This is different from the “fiber” style of the coroutine, where we expose the entire stack. The implementation of this looks somewhat like a series of callbacks and a hidden linked list of concatenated coroutine frames exposed by the sequence of function calls.

The style of coroutines chosen for C ++ has great advantages, which we will not go into here, but one disadvantage is that the nested structure is present in the code, but is not directly visible in the stack frames visible to a debugger or profiling tool.

To illustrate this problem, let’s look at the following more complicated code snippet from folly :: coro code that we want to sample a stack trace when we run:

void normal_function_1 () {// … expensive code – example taken here. maintain. Void normal_function_2 () {// … normal_function_1 (); // …} folly :: coro :: task coro_function_1 () {// … normal_function_2 (); // … co_return; } Folly :: coro :: task coro_function_2 () {// … co_await coro_function_1 (); // …} void run_application () {// … folly :: coro :: blockingWait (coro_function_2 (). scheduleOn (folly :: getGlobalCPUExecutor ())); // …} int main () {run_application (); }

Currently, if the profiler is capturing an example of executing code inside normal_function_1 (), it might display a stack trace that looks something like this:

– normal_function_1 – normal_function_2 – coro_function_1 – std :: coroutine_handle :: resume – folly :: coro :: TaskWithExecutor :: Awaiter :: await_suspend :: $ lambda0 :: operator () – folly :: Function :: operator () – Folly: : CPUThreadPoolExecutor :: run – std :: thread :: _ invoke – __thread_start

Note that only coro_function_1 with missing coro_function_2 is displayed. The trace goes from coro_function_1 to internal details of the framework and the executor. We cannot see from this stack trace that coro_function_1 () is called by coro_function_2 () and that coro_function_2 () is called in turn by run_application () and main ().

Also, if there are multiple call points to coro_function_1 (), a sample profiling system is likely to have all of their samples from different call points merged, making it difficult to determine which call points are expensive. This can make it difficult to determine what to focus your efforts on when looking for performance optimizations.

Ideally, both profiling tools and debuggers could capture the logical stack trace instead. The result would show the relationship between coro_function_1 and its caller coro_function_2, and we would end up with a stack trace for this example that looks more like this:

– normal_function_1 – normal_function_2 – coro_function_1 – coro_function_2 – BlockingWait – run_application – main

Folly support for asynchronous stack traces

folly now implements a number of tools to support asynchronous stack traces for coroutines. The library provides basic hooks used by internal code profiling libraries. The same hooks provide access to stack traces for debugging purposes.

These are briefly summarized here and we will go into more detail in a later post.

Printing asynchronous stack traces when the program crashes

Probably the most common place developers see stack traces is when programs crash. The Folly library already offers a signal handler that outputs the stack trace of the thread that crashes the program. The signal handler now prints the asynchronous stack trace if a coroutine is active in the current thread when it crashes.

Printing of asynchronous stack traces on demand

During development, developers often ask themselves: What series of function calls resulted in this function being called? folly provides a handy feature to easily print the asynchronous stack trace on demand, so developers can quickly see how a function or coroutine is called:

#contain
#contain

Folly :: coro :: task co_foo () {std :: cerr << folly :: symbolizer :: getAsyncStackTraceStr () << std :: endl; }

GDB extension for printing asynchronous stack traces

C ++ developers often have to use debuggers like GDB to fix crashes afterwards or to investigate faulty behavior in running programs. We recently implemented a GDB extension to easily print the asynchronous stack trace for the current thread from the debugger:

# Output of the asynchronous stack trace for the current thread (gdb) co_bt 0x … in crash () () 0x … in co_funcC () [clone .resume] () 0x … in co_funcB () [clone .resume] () 0x … in co_funcA () [clone .resume] () 0x … in main () 0x … in folly :: dependent_task () ()

Track where exceptions are thrown

We want to be able to tell where an exception was created / thrown from if it is later caught. To meet this need, folly provides an exception tracking library that can be built into the Itanium C ++ ABI to enable exception handling to track exception information. We recently expanded the helper functions of this library to easily keep track of where exceptions are thrown, including normal and asynchronous stack traces. Below is a sample program that uses these helpers:

Folly :: coro :: task co_funcA () {try {co_await co_funcB (); } catch (const std :: Exception & ex) {// outputs where the exception was thrown std :: cerr << "what ():" << ex.what () << "," << folly :: Exception_tracer:: getAsyncTrace (ex) << std :: endl; }}

Finally

This new functionality brings coroutines much closer to the level of debugging and tracing support we see with normal stacks. Facebook developers are already benefiting from these changes, and they are open source in the Folly library for anyone to use.

In the following posts in this series, we’ll go into more detail on how this is implemented. Next week we’ll discuss the differences between synchronous and asynchronous stack traces and the technical challenges involved in implementing traces on C ++ 20 coroutines. Stay tuned!

Comments are closed.