Forming an async stack from particular person frames

This article was written by Lee Howes and Lewis Baker from Facebook.

This is the third in a series of posts covering how we have used C++ coroutines at Facebook to regain stack traces for dependent chains of asynchronous waiting tasks. In the previous post in the series, we looked at the difference between synchronous and asynchronous stacks and the data structures we can use to chain them. In this post we’ll look at how to tie those data structures together into a chain.

Obtaining the return-address of an async-stack frame

The main goal in implementing async-stack traces is to allow us to produce a call-stack of instruction pointers that we can later map back to function names and ideally file and line-numbers using debug information for the binary.

We need some way to be able to determine, for each async stack frame, what the calling coroutine’s return address is. For folly::coro::Task objects, this will be some instruction that forms part of the co_await expression that launched the task.

In theory, the address of the continuation of the awaiting coroutine is already being stored in the awaiting coroutine’s coroutine-frame when it is suspended since the coroutine needs to know where to resume execution when we eventually call std::coroutine_handle::resume(). However, the way that the compiler currently encodes this information makes it inaccessible to external tools.

Let’s recap how Clang currently stores the resumption address for a coroutine. Imagine we have the following coroutine:

folly::coro::Task foo()
{ // implicit ‘co_await promise.initial_suspend();’
co_await something();
co_await something_else();
co_return 42;
} // implicit ‘co_await promise.final_suspend();’

When the compiler lowers this coroutine, it will synthesize a coroutine frame type that looks something like this:

struct __foo_coro_frame {
using promise_type =
typename std::coroutine_traits>::promise_type;
void(*resumeFn)(void*);
void(*destroyFn)(void*);
promise_type promise;
int suspendPoint;
// … other data
};

When the coroutine is first invoked it creates the coroutine frame and initializes the resumeFn and destroyFn members to point to the resume and destroy functions. Each ‘co_await’ expression in the coroutine (either implicit or explicit) is assigned a suspend-point index. Whenever the coroutine suspends at one of these suspend points it writes the corresponding suspend-point index to the suspendPoint member of the frame.

When the coroutine is resumed by calling coroutine_handle::resume(), this ends up invoking the resumeFn function pointer and passes a pointer to the coroutine frame as the argument. The resume implementation reads the suspendPoint member and uses the result as the operand to a switch block that jumps to the continuation of the corresponding suspend-point.

For some coroutines this switch is encoded as a jump table, from which we could in theory be able to lookup the continuation address if we knew how to find the jump-table address. This approach would likely require some compiler changes to encode it in a standard way. For other coroutines with only a handful of suspend points the compiler often lowers the switch to a series of conditional branches.

See this example in Compiler Explorer (https://godbolt.org/z/994b4M). A coroutine with two suspend-points:

detached_task small_coro() {
some_awaitable a;
co_await a;
co_await b;
}

has its resume function compiled to conditional branches

small_coro() [clone .resume]:
mov rsi, rdi
cmp byte ptr [rdi + 17], 0
je .LBB2_1
mov rdi, rsi
jmp operator delete(void*) # TAILCALL
.LBB2_1:
mov byte ptr [rsi + 17], 1
lea rdi, [rsi + 18]
jmp some_awaitable::await_suspend(std::coroutine_handle) # TAILCALL

Whereas a coroutine with four or more suspend points:

detached_task large_coro() {
some_awaitable a;
co_await a;
co_await a;
co_await a;
co_await a;
}

is compiled to a jump table:

large_coro() [clone .resume]: # @large_coro() [clone .resume]
mov rsi, rdi
movzx eax, byte ptr [rdi + 17]
jmp qword ptr [8*rax + .LJTI5_0]
.LBB5_1:
mov byte ptr [rsi + 17], 1
lea rdi, [rsi + 18]
jmp some_awaitable::await_suspend(std::coroutine_handle) # TAILCALL
.LBB5_3:
mov byte ptr [rsi + 17], 3
lea rdi, [rsi + 18]
jmp some_awaitable::await_suspend(std::coroutine_handle) # TAILCALL
.LBB5_4:
mov rdi, rsi
jmp operator delete(void*) # TAILCALL
.LBB5_2:
mov byte ptr [rsi + 17], 2
lea rdi, [rsi + 18]
jmp some_awaitable::await_suspend(std::coroutine_handle) # TAILCALL
.LJTI5_0:
.quad .LBB5_1
.quad .LBB5_2
.quad .LBB5_3
.quad .LBB5_4

We won’t know which strategy the compiler took unless we look at the assembly for the resume function. This is not something we want to be doing when walking the stack. This makes it infeasible to try to use the compiler’s existing representation of the suspend-point of the coroutine to calculate the return-address.

One approach could be to simply use the resume-function address itself. This would allow us to identify which function this coroutine-frame corresponded to but would not allow us to identify the particular suspend-point / line-number within that function. For some use-cases that may be enough but we’re hoping to get a solution that will also give us the line numbers.

Another approach, which could be explored in future, is to modify the compiler to avoid using the suspendPoint integer index altogether and to instead just update the resumeFn and destroyFn pointers directly. We could simply read the resumeFn field to obtain the return-address when tracing the stack, and at resume-time would also avoid a double dynamic-dispatch (indirect call to resumeFn + the switch on suspendPoint). This change would come at the cost of having to write 2 pointers when suspending a coroutine, possibly involving extra instructions to load or calculate the pointer values, instead of a write of a constant to a single 16/32-bit field. Whether or not this is a net improvement would have to be evaluated.

For now we are looking for a solution that doesn’t require changes to the compiler and as it does not seem viable to use the existing state within the coroutine, we need to add additional state and populate it with the return-address. To do this, we add a returnAddress field to the AsyncStackFrame structure we have already added to the coroutine promise:

namespace folly {
struct AsyncStackFrame {
AsyncStackFrame* parentFrame;
void* returnAddress;
// … other fields
};
}

Before we start executing the coroutine that we initialise the returnAddress member of its promise’s AsyncStackFrame to an address that maps to the line number of the co_await expression that launched it.

To understand how we obtain the value for this return address we first need to understand how the compiler translates a co_await expression. This is covered in detail in the blog post Understanding operator co_await, but to summarise, whenever you write:

co_await someTask

This expands into roughly:

decltype(auto) operand = someTask;
decltype(auto) awaitable = promise.await_transform(operand);
decltype(auto) awaiter = awaitable.operator co_await();
if (!awaiter.await_ready()) {
// suspend-point (writes to frame->suspendPoint field)
auto handle = awaiter.await_suspend(
coroutine_handle::from_promise(promise));
return handle.resume();
// resume-point (execution will resume here)
}
awaiter.await_resume();

There are many possible variations on this depending on the types involved but the differences are not important for our purposes. The key point we want to hook into here is the await_suspend() method, which is called immediately after the coroutine is suspended and before resume() is called on the coroutine_handle that we are transferring execution to.

Note that the await_suspend() method is being called from the awaiting coroutine as part of evaluating the co_await expression. This means that if we are able to obtain the return address of the call to await_suspend() then this address should point to instructions whose debug info should map back to the file and line number of the co_await expression.

We can make use of a special compiler intrinsic offered by Clang to obtain the return address of the current function: __builtin_return_address(). This intrinsic takes an integral argument and will obtain the return-address of the nth stack-frame. We pass 0 to obtain the current function’s return-address.

To populate the return address field we can do something like the following:

template
auto Task::Awaiter::await_suspend(std::coroutine_handle<> continuation) {
coro_.promise().getAsyncFrame().returnAddress = __builtin_return_address(0);
// … rest of the await_suspend logic
return coro_;
}

One thing to be aware of is that if this method is inlined then the __builtin_return_address() intrinsic will evaluate to the return-address of the function that it was inlined into instead of the return address of await_suspend(), giving us the wrong result – most likely the address of the function that called coroutine_handle::resume() instead of the address of the co_await expression.

To prevent this from happening, we need to prevent the await_suspend() method from being inlined into the coroutine body. This can be done by marking the await_suspend() method as __attribute__((noinline)). Inhibiting inlining of this method may have a small performance impact but may also help reduce binary size. An intrinsic that obtained the continuation address in such a way that was inlining-aware would be another possible avenue for further compiler work if we determine that inlining here would be a worthwhile performance improvement.

Hooking up the stack-frames in a chain

We’ve already looked at how we represent the linked list of AsyncStackFrame objects and how we populate each frame’s returnAddress, but we also need to update the parentFrame pointer to have a child coroutine’s AsyncStackFrame correctly point back to the parent’s AsyncStackFrame. We achieve this in the same place as where we fill out the returnAddress: in the await_suspend method.

To make this work, we need some way to get the AsyncStackFrame of the awaiting coroutine. As the AsyncStackFrame is stored inside the promise, we need to obtain the AsyncStackFrame from the promise of the caller and to obtain the promise of the caller we need to call the promise() method on the coroutine_handle passed to await_suspend().

Since the Task may be awaited from many different kinds of coroutines, each with their own promise-type, we’ll need to turn the await_suspend method into a template function so that we can deduce the right promise type.

We also need to define a standard way to obtain the AsyncStackFrame from a coroutine’s promise so that different promise types can implement it and have generic code able to find it. We require that the promise-type expose a getAsyncFrame() member function that returns a reference to the AsyncStackFrame object.

Putting this all together, we now have an await_suspend() method that looks like this:

template
template
__attribute__((noinline))
auto Task::Awaiter::await_suspend(std::coroutine_handle continuation) {
auto& callerFrame = continuation.promise().getAsyncFrame();
auto& calleeFrame = coro_.promise().getAsyncFrame();
calleeFrame.parentFrame = &callerFrame;
calleeFrame.returnAddress = __builtin_return_address(0);
// … other await_suspend() logic
return coro_;
}

By turning this await_suspend() method into a template we have now introduced a binary size bloat problem as this method will now be instantiated for each combination of caller and callee promise types, even though the only thing that is different is the offset of the AsyncStackFrame within the Promise.

We can factor out the common parts into a helper method that is not dependent on the awaiting coroutine’s promise type:

template
template>typename Promise>
__attribute__((always_inline))
auto Task::Awaiter::await_suspend(std::coroutine_handle continuation) {
return await_suspend_impl(continuation, continuation.promise().getAsyncFrame());
}

template
__attribute__((noinline))
auto Task::Awaiter::await_suspend_impl(std::coroutine_handle<> continuation,
AsyncStackFrame& callerFrame) {
auto& calleeFrame = coro_.promise().getAsyncFrame();
calleeFrame.parentFrame = &callerFrame;
calleeFrame.returnAddress = __builtin_return_address(0);
// … other await_suspend() logic
return coro_;
}

Now we have the coroutines maintaining a chain of AsyncStackFrame objects, and these AsyncStackFrame objects record (an approximation of) the return-address for each coroutine. The next step is to be able to find the currently active / topmost AsyncStackFrame and its corresponding normal stack-frame so we know where in the normal stack trace to splice in the async stack. We’ll discuss that, and how to walk the stack of frames, in the next post in the series.

To learn more about Facebook Open Source, visit our open source site, subscribe to our YouTube channel, or follow us on Twitter and Facebook.

Interested in working with open source technologies at Facebook? Check out our open source-related job postings on our career page.

Comments are closed.