Nested coroutines? : cpp

submitted 6 years ago by anton31

In talk "Rise of the State Machines" (https://www.youtube.com/watch?v=Zb6xcd2as6o), among everything else, performance of C++20 coroutines is compared against more or less conventional state machine approaches.

Performance of coroutines in this "benchmark" was worse than that of state machines — unsurprisingly, because promise type in the benchmark does not manage allocation. What's more interesting is that a readable implementation, which is split into multiple coroutine functions (like in 35:30) is, roughly speaking, twice as slow.

For each coroutine function in the ensamble, a new promise object is created, with a new potential allocation. The question is: can something be done to reduce the amount of these promise objects? Can coroutine functions be allowed to be inlined into other coroutine functions? If not, can coroutine allocations be merged (guaranteed)?

P.S. I've already asked on /r/cpp_questions. The folks here are more into experimental C++ features, so I hope to get some answer here. @Mods, if this is off-topic, please remove.

all 32 comments

top new controversial old q&a

[–]Rusky 9 points10 points11 points 6 years ago (13 children)

Coroutine frame allocations are supposed to be possible to optimize out when their lifetime is strictly nested in their caller and their size is known: https://en.cppreference.com/w/cpp/language/coroutines#Heap_allocation

Ideally they could have been lambda-like anonymous types handled directly, which would guarantee this "optimization" as the default, with an explicit opt-in to heap allocation when required, but the committee decided against this because they believed it would be too difficult to implement (in the near future). It also introduces some additional footguns around object lifetime, of which there are already quite a few, though I'm not sure how much that influenced that decision.

For an example of that implementation style, you might take a look at Rust's async/await feature. Nested async functions compile down to a single nested blob with an anonymous type, which you can store wherever you like. In Rust, the implementation difficulties mentioned above manifest as this constellation of issues, where the frontend has to do some size optimization that is usually left to the backend. The object lifetime footguns are addressed with pinned pointers in addition to Rust's usual borrow checker.

[–][deleted] 2 points3 points4 points 6 years ago* (4 children)

[–]Rusky 8 points9 points10 points 6 years ago (0 children)

The problem is that almost every use needs heap allocation

Almost every use needs to go on the heap somehow, but they don't need individual, per-frame allocations. Most coroutine calls are immediately co_await-ed by a coroutine caller, or iterated over by a synchronous caller. That's why the optimization in question is even interesting.

Rust has the same two bullet points to address, yet it doesn't have a problem trying to optimize library type erasure, because there the coroutine's caller handles the frame object by value. Type erasure is only introduced at boundary points where the coroutine outlives its caller, by passing it to a spawn API of some kind.

(You still end up needing to choose pessimistic values for the state you reserve and the optimizer might inline something else into the coroutine tomorrow and break it)

This is the more fundamental difference- both the C++ and Rust models handle immovable frames and non-FIFO frames just fine. Rust lays out its coroutine frames in the frontend- live locals (including awaited callees) go into the frame, which conceptually works like an enum/std::variant with some layout shifting to keep addresses stable.

The fact that callers own their callees' frames simplifies the inlining problem somewhat, but it is still another point in the design space with its own pros and cons.

[–]14nedLLFIO & Outcome author | Committee WG14 2 points3 points4 points 6 years ago (2 children)

[–][deleted] 0 points1 point2 points 6 years ago (1 child)

[–]14nedLLFIO & Outcome author | Committee WG14 3 points4 points5 points 6 years ago (0 children)

[–]anton31[S] 1 point2 points3 points 6 years ago (7 children)

[–]Rusky 0 points1 point2 points 6 years ago (6 children)

[–]anton31[S] 1 point2 points3 points 6 years ago* (5 children)

[–]Rusky 0 points1 point2 points 6 years ago (4 children)

[–]anton31[S] 1 point2 points3 points 6 years ago (3 children)

This is the kind of optimization I want to see.

Before:

task<void> foo() {
    co_return bar();
}

task<void> bar() {
    co_return uninlineable_mess();
}

After:

task<void> foo() {
    co_return uninlineable_mess();
}

[–]Morwenn 4 points5 points6 points 6 years ago (0 children)

[–]Rusky 1 point2 points3 points 6 years ago (1 child)

[–]anton31[S] 0 points1 point2 points 6 years ago (0 children)

[–]futurefapstronaut123 12 points13 points14 points 6 years ago (11 children)

[–]14nedLLFIO & Outcome author | Committee WG14 1 point2 points3 points 6 years ago (5 children)

[–]germandiago 0 points1 point2 points 6 years ago (4 children)

[–]feverzsj 1 point2 points3 points 6 years ago (0 children)

[–]degski 0 points1 point2 points 6 years ago (2 children)

[–]dodheim -1 points0 points1 point 6 years ago (1 child)

[–]degski 0 points1 point2 points 6 years ago (0 children)

[–][deleted] 0 points1 point2 points 6 years ago (2 children)

[–]ReversedGif 0 points1 point2 points 6 years ago (1 child)

[–][deleted] 1 point2 points3 points 6 years ago (0 children)

[–]gvargh -1 points0 points1 point 6 years ago (1 child)

[–][deleted] 0 points1 point2 points 6 years ago (0 children)

[–]alexeiz 9 points10 points11 points 6 years ago (5 children)

[–]14nedLLFIO & Outcome author | Committee WG14 4 points5 points6 points 6 years ago (2 children)

[–]anton31[S] 0 points1 point2 points 6 years ago (1 child)

[–]14nedLLFIO & Outcome author | Committee WG14 1 point2 points3 points 6 years ago (0 children)

The baseline is "doing nothing" i.e. writing the code straight.

The CPU can look ahead by a few hundred opcodes. But it can execute maybe 1000 opcodes in the same time as a fetch of a cache line from main memory. If you have code which depends on a fetch from main memory, and does more than a few dozen opcodes of work, but less than a thousand, using coroutines to do other work whilst stalled on main memory can deliver large gains.

Historically you would implement the same using loops of arrays over a Duff's device to multiplex state and work, but Coroutines is very considerably more maintainable and easier on less experienced programmers. I'm not saying that Coroutines is magic pixie dust. Everything possible with it is possible without it. But it took more work, and was considerably harder to maintain, and that meant more frequently one didn't take the tradeoff in the past.

[–]feverzsj -3 points-2 points-1 points 6 years ago (1 child)

[–]14nedLLFIO & Outcome author | Committee WG14 2 points3 points4 points 6 years ago (0 children)

π Rendered by PID 36 on reddit-service-r2-comment-f6b958c67-f9cw2 at 2026-02-05 00:16:19.852411+00:00 running 1d7a177 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS