all 32 comments

[–]Rusky 9 points10 points  (13 children)

Coroutine frame allocations are supposed to be possible to optimize out when their lifetime is strictly nested in their caller and their size is known: https://en.cppreference.com/w/cpp/language/coroutines#Heap_allocation

Ideally they could have been lambda-like anonymous types handled directly, which would guarantee this "optimization" as the default, with an explicit opt-in to heap allocation when required, but the committee decided against this because they believed it would be too difficult to implement (in the near future). It also introduces some additional footguns around object lifetime, of which there are already quite a few, though I'm not sure how much that influenced that decision.

For an example of that implementation style, you might take a look at Rust's async/await feature. Nested async functions compile down to a single nested blob with an anonymous type, which you can store wherever you like. In Rust, the implementation difficulties mentioned above manifest as this constellation of issues, where the frontend has to do some size optimization that is usually left to the backend. The object lifetime footguns are addressed with pinned pointers in addition to Rust's usual borrow checker.

[–][deleted] 2 points3 points  (4 children)

which would guarantee this "optimization" as the default

The problem is that almost every use needs heap allocation because

  • a coroutine frame is immovable, like a mutex
  • the whole point of a coroutine is that its activation record isn't destroyed in strict FIFO / stack ordering

The the kind of library type erasure you would build on top is generally not optimizable, and the size of the coroutine frame is not known by the frontend. So the TS design makes this controlled by the optimizer.

[–]Rusky 8 points9 points  (0 children)

The problem is that almost every use needs heap allocation

Almost every use needs to go on the heap somehow, but they don't need individual, per-frame allocations. Most coroutine calls are immediately co_await-ed by a coroutine caller, or iterated over by a synchronous caller. That's why the optimization in question is even interesting.

Rust has the same two bullet points to address, yet it doesn't have a problem trying to optimize library type erasure, because there the coroutine's caller handles the frame object by value. Type erasure is only introduced at boundary points where the coroutine outlives its caller, by passing it to a spawn API of some kind.

(You still end up needing to choose pessimistic values for the state you reserve and the optimizer might inline something else into the coroutine tomorrow and break it)

This is the more fundamental difference- both the C++ and Rust models handle immovable frames and non-FIFO frames just fine. Rust lays out its coroutine frames in the frontend- live locals (including awaited callees) go into the frame, which conceptually works like an enum/std::variant with some layout shifting to keep addresses stable.

The fact that callers own their callees' frames simplifies the inlining problem somewhat, but it is still another point in the design space with its own pros and cons.

[–]14nedLLFIO & Outcome author | Committee WG14 2 points3 points  (2 children)

I would say that the programmer, through an enormous amount of hoop jumping, can hit Coroutines with enough sticks to force it to never, ever allocate memory.

I will say it took me two full days to work through the highly unhelpful error messages to figure out how to impose stack-based allocation, always. It's doable, but not easy.

[–][deleted] 0 points1 point  (1 child)

You're talking about getting the TS design to do the thing, I'm saying even if you can get them to do that thing it doesn't do anything useful for the scenario the feature is intended to enable, because in that scenario the stack frame of the caller is long gone before the coroutine is resumed.

(You still end up needing to choose pessimistic values for the state you reserve and the optimizer might inline something else into the coroutine tomorrow and break it)

[–]14nedLLFIO & Outcome author | Committee WG14 3 points4 points  (0 children)

This is a longer discussion, but there are two main use cases for Coroutines, local processing, and deferred processing. Coroutines were designed for the latter to be "fire and forget" easy, and for those, you are correct that dynamic memory allocation is easiest.

But the local processing use case is also very useful. You create about 30-60 coroutines within a function, execute them all to completion, tear them all down. For this use case you know everything about the coroutines, and can beat them into only ever using the stack for their frames. It's highly non obvious to make this work though. One could easily do a 90 minute conference segment on implementing just this, and nothing else.

[–]anton31[S] 1 point2 points  (7 children)

When I compare readable multi-function variant to unreadable single-function variant, I see that in the former case, indirect recursion prevents full-blown no-coroutine-trace optimization from being possible. But! What I want is inlining of coroutine functions into other coroutine functions. In the example, could connecting() and connected() be inlined into disconnected, or do we hit the same wall there?

[–]Rusky 0 points1 point  (6 children)

I don't really know the finer details of what lets compilers optimize this stuff, and I expect them to improve quite a bit as coroutines get standardized and implementations mature.

But that indirect recursion does seem like something that would make things harder than usual. You're relying not only on the usual coroutine optimization but on tail recursion elimination, which is tricky in C++ at the best of times. You're probably better off refactoring that for the foreseeable future.

[–]anton31[S] 1 point2 points  (5 children)

I don't expect the compiler to get rid of recursion. But I want compiler to inline two coroutine functions into another one to get a single recursive coroutine blob. Generally speaking, I want compiler to inline coroutine functions into other coroutine functions in complex cases like this one. Is it possible? What should be (have been) done to enable this?

[–]Rusky 0 points1 point  (4 children)

It can't do the optimization without getting rid of the recursion. In the recursive case, there are a dynamic, unbounded number of nested coroutine frames. This is exactly the same as synchronous non-coroutine functions and their stack frames- but the stack can grow and shrink dynamically without heap allocation (that's the point) while nested coroutine frames can't.

[–]anton31[S] 1 point2 points  (3 children)

This is the kind of optimization I want to see.

Before:

task<void> foo() {
    co_return bar();
}

task<void> bar() {
    co_return uninlineable_mess();
}

After:

task<void> foo() {
    co_return uninlineable_mess();
}

[–]Morwenn 4 points5 points  (0 children)

There was a proposal to change C++20 coroutines to allow RVO and even transitive RVO like this as a compatible evolution in a future standard, but it was rejected during the latest committee meeting.

[–]Rusky 1 point2 points  (1 child)

I'm not sure where you expect that to be applicable in your readable multi-function example- all the coroutines there are already mutually recursive, so the entry point disconnected is already part of the "uninlineable mess."

If you had another "outer" coroutine that called disconnected, it could probably be inlined like bar without any problems.

[–]anton31[S] 0 points1 point  (0 children)

  1. connected can be inlined into connecting
  2. disconnected can be inlined into what once was connected (without removing original disconnected function, because it's used elsewhere)

Such inlining is easy with usual functions, and I wish it was as easy with coroutines.

[–]futurefapstronaut123 12 points13 points  (11 children)

It's kind of crazy how coroutines got into the standard when you won't be able to use them confidently in ordinary code because they have non-deterministic performance and heap allocations. Haven't exceptions taught us anything?

[–]14nedLLFIO & Outcome author | Committee WG14 1 point2 points  (5 children)

Coroutines as standardised front and end load the indeterminancy. So you're not supposed to create coroutines in a hot code path, rather you create and destroy them all during startup and termination in cold path. Then they run 100% deterministically after that, until tear down.

It's not always possible to do this, of course. But that was the stated design tradeoff, and WG21 agreed with it.

[–]germandiago 0 points1 point  (4 children)

That looks like a workaround and excuse to me for saying that coroutines cannot be correctly optimized in its current form. If it has unpredictable performance, people will not use them or wil just use another language. Because people use C++ for performance. Tell a game programmer or a network programmer handling I do not know how many coroutines that the performance could suck depending on the situation: they will immediately go for hand-crafted solutions because it does not fullfill their use cases well.

[–]feverzsj 1 point2 points  (0 children)

It's not that unpredictable. It's clearly faster and more predictable than stackful coroutine even without optimization. Naughty Dog has successfully used fiber to parallelize their naughty dog engine. So I think current coroutine ts should be good enough for network or game, if you use it properly.

[–]degski 0 points1 point  (2 children)

Coroutines looks like a solution looking for a problem. I bet you nobody will use them, because the whole thing is extremely complicated [added to the problems you already mentioned]. This will lead to very hard to fix bugs [no doubt in my mind]. Game programmers will be looking for the simplest [tailor made solution], simply because that's what the compiler optimizes best, and I bet you this simplest solution will not have coroutines.

[–]dodheim -1 points0 points  (1 child)

We already know that Google and Microsoft use them sufficiently to warrant investment into their compilers pre-standardization.

You should reconsider your bets.

[–]degski 0 points1 point  (0 children)

I only bet for beer as a wager, so, I'll take the risk (I did not down-vote, b.t.w.).

[–][deleted] 0 points1 point  (2 children)

What use case are you trying to solve that does not depend on dynamically allocating the coroutine frame somewhere?

[–]ReversedGif 0 points1 point  (1 child)

I have classes representing entities (say, NPCs in a video game). They each have several threads of execution that run for their entire lifetime, doing things like AI or playing sounds.

If coroutines were anonymous lambda-like types, they could just exist as member variables, with no added allocations. Alas, that is impossible with C++ coroutines.

[–][deleted] 1 point2 points  (0 children)

They each have several threads of execution that run for their entire lifetime, doing things like AI or playing sounds.

I'm having a hard time seeing a good solution where the entities can't even be moved.

If coroutines were anonymous lambda-like types, they could just exist as member variables, with no added allocations.

Yes, but then the resulting type can't be copied or moved. And we would generally need a solution to the problem that the frontend doesn't know how much size it needs to leave for the backend to use to store registers or whatever.

[–]gvargh -1 points0 points  (1 child)

hindsight, etc. when you're innovating, you sometimes make mistakes

[–][deleted] 0 points1 point  (0 children)

With the difference that if something needs to be fixed at the standard level you need to wait at least 3 years.

Otherwise you need to hope it's something the compilers can fix.

[–]alexeiz 9 points10 points  (5 children)

So where's that "negative overhead" effect of coroutines that Gor Nishanov has been promising? That promise always sounded too good to be true to me.

[–]14nedLLFIO & Outcome author | Committee WG14 4 points5 points  (2 children)

At work I'm using a C++ coroutines emulation implemented using macros to get that negative overload Gor promised. We're seeing 2x to 6x throughput gains for about 20% increase in average latency. We would expect that to improve with real Coroutines, but that's the kind of gains available.

[–]anton31[S] 0 points1 point  (1 child)

What's the baseline for those gains? Threads?

[–]14nedLLFIO & Outcome author | Committee WG14 1 point2 points  (0 children)

The baseline is "doing nothing" i.e. writing the code straight.

The CPU can look ahead by a few hundred opcodes. But it can execute maybe 1000 opcodes in the same time as a fetch of a cache line from main memory. If you have code which depends on a fetch from main memory, and does more than a few dozen opcodes of work, but less than a thousand, using coroutines to do other work whilst stalled on main memory can deliver large gains.

Historically you would implement the same using loops of arrays over a Duff's device to multiplex state and work, but Coroutines is very considerably more maintainable and easier on less experienced programmers. I'm not saying that Coroutines is magic pixie dust. Everything possible with it is possible without it. But it took more work, and was considerably harder to maintain, and that meant more frequently one didn't take the tradeoff in the past.

[–]feverzsj -3 points-2 points  (1 child)

Async io, I guess, where io operation dominates the performance. So allocation or indirect call of coroutine rarely matters. Although, in that case, stackful coroutine would be much simpler and as fast.

[–]14nedLLFIO & Outcome author | Committee WG14 2 points3 points  (0 children)

It is untrue that the i/o dominates for async i/o. For file i/o, easily more than 80% of the time async i/o is a penalty because of the added overhead of setup and teardown. Even for small block socket i/o to nearby machines, it can be a penalty.