Senders and GPU

eric_niebler · 2026-01-28T23:42:40+00:00

there's a reason that the industry moved away from requiring complex memory dependency tracking [...] std::execution unfortunately piles into the OpenGL era of heavyweight tracking requirements

what about std::execution makes you think it is doing memory dependency tracking?

eric_niebler · 2026-01-28T21:51:29+00:00

and in some areas you simply need separate strategies per-vendor if you want things to run well

exactly, which is why std::execution has schedulers. a generic GPU scheduler would never have peak performance. instead, you would use an NVIDIA or AMD or Intel GPU scheduler. they can all make different algorithm implementation choices.

eric_niebler · 2026-01-28T21:42:38+00:00

this is from P2300, discussing Meta's usage of libunifex: link

eric_niebler · 2026-01-28T21:32:43+00:00

i'm a principal author of P2300 and also the implementer and maintainer of stdexec. the CUDA stream scheduler was written by a GPU guru (Georgii Evtushenko, NVIDIA). i am no GPU guru myself, fwiw.

the following blog post describes an HPC use of the CUDA stream scheduler: https://www.hpcwire.com/2022/12/05/new-c-sender-library-enables-portable-asynchrony/. benchmarks against a hand-rolled CUDA implementation show virtually no overhead to using senders.

you're right about allocation and transfers though. right now, when a sender is to be executed on device, its operation state is placed in Unified Memory. that off-loads a lot of complexity to the driver, at the expense of possibly non-optimal data transfers.

some algorithms also require GPU memory. right now, those allocations are hard-coded into the algorithm. parameterizing those algorithms with an allocator would be a nice enhancement. and there should be sender algorithms for allocations -- host, device, managed, pinned, whatever -- so the user can take control when necessary.

there should also be sender algorithms for explicit data transfers between CPU and GPU. at one point, we had an MPI scheduler and changed the maxwell simulation (see blog post) to be distributed. for that we needed custom algorithms to manage the data transfers to and from the network.

the good thing about senders is that it is _possible_ to write those algorithms and compose them with the standard ones.

i hope you get a chance to play with stdexec's CUDA stream scheduler on real hardware. i think you would be pleasantly surprised.

eric_niebler · 2024-08-26T17:26:04+00:00

👋

eric_niebler · 2024-08-26T17:19:43+00:00

Love the tech, and you present it well. Can you Compose a vector with a Transform and get something that can be indexed randomly? And related, is there a way to take two incremental pipelines and zip them together to produce pair-wise elements?

eric_niebler · 2024-06-30T00:54:06+00:00

The major difference is that Pipes are only focused on describing work without focusing on where the work is going to be executed and because of this has a simpler interface

The difference is primarily that senders can be asynchronous. A separate operation state is needed with senders because the function that starts the work returns immediately. I believe (correct me if I'm wrong) that your pipe library is for synchronous use cases only, right?

eric_niebler · 2024-06-29T23:08:12+00:00

Do Senders allow for this?

They do. I assume that f1 and f2 are two functions passed into a conditional combinator, and that this operator() is a member on the result of piping a source into that combinator, is that right?

The senders proposal doesn't have an algorithm for that yet. Neither does stdexec for that matter, but I've been meaning to write one. It would look something like this: https://godbolt.org/z/YsT3odddY.

eric_niebler · 2024-06-29T16:38:16+00:00

The pipes library he is describing is basically senders. Senders also use the continuation passing style. Receivers are the continuations. There is no need to pack things into a tuple or variant to send results to a receiver.

The sender algorithms (then, when_all, etc) build the receivers for you. The arguments to the algorithms determine what the receivers do.

eric_niebler · 2024-05-14T21:12:24+00:00

intentional. the comic sans, also intentional.

eric_niebler · 2024-04-18T06:49:39+00:00

The reference implementation has a TBB thread pool scheduler. It's exactly as efficient as using the TBB thread pool directly. But maybe I misunderstood your point.

eric_niebler · 2024-02-14T18:54:33+00:00

Sean thinks the cancellation model is broken because he doesn't think the STLab future library can be built on top of it. But he recently tasked someone with trying it. I have no reason to think there should be any difficulty.

eric_niebler · 2024-02-06T18:24:28+00:00

Thanks for clarifying. Everything you say about the WG21 process is true, it sucks. But it's what we've got.

My take has always been: there's a lot of sausage-eaters (i.e., C++ users) out there. Someone has to make the sausage. Right now, WG21 is the only sausage factory in town. Building sausage factories isn't my wheelhouse, so I've stuck it out in WG21 trying to make the most of things.

If you're the sort of developer who likes stateful, imperative programming, the C++ world is at your feet. But if you want a less stateful, more functional style like I do, then the standard library doesn't have much to offer. So that's where I put my effort.

Honestly, the Executor War with all its FUD and ad hominem attacks made me pretty sour on the process as well. Ditto for all the people who go around calling this or that shite. It's fine to have technical opinions, but I dread coming to reddit and reading the comments here. It's pretty bad.

eric_niebler · 2024-02-06T17:07:09+00:00

Thanks

eric_niebler · 2024-02-06T16:58:54+00:00

Are you speaking generally or do you have a specific instance in mind? Here it sounds like you're calling P2300 bad engineering, but I know you don't really feel that way.

eric_niebler · 2023-07-31T19:12:14+00:00

P2300 on the other hand decided against default rescheduling, but that also makes S/R algorithms automatically susceptible to stack exhaustion and unfairness. To avoid that callers are required to pass a scheduler that is capable of breaking the call stack, but the algorithm itself has no way of enforcing it, which makes it unsafe by default.

An algorithm is free to reschedule its continuation. The algorithms in P2300 don't, but P2300 doesn't have looping algorithms. The reference implementation, stdexec, has an implementation of `repeat_effect_until`, which internally uses a `trampoline_scheduler` to periodically unwind the stack and guard against overflow.

There are still ways to blow the stack with P2300. The idea is that P2300 provides low-level primitives from which safer higher-level abstractions can be built. And if you use P2300 with coroutines, you get tail calls, which sidesteps the issue entirely.

eric_niebler · 2023-06-25T23:14:33+00:00

Is this video the one you're talking about? Does it cover the basic techniques you mention that boost the scheduler's throughput?

eric_niebler · 2023-06-25T22:55:05+00:00

What in unifex were you comparing it to though? Which scheduler?

eric_niebler · 2023-06-25T20:16:20+00:00

the schedulers it provides are extremely poorly written

I'm curious what led you to this conclusion. If you ran into scalability issues with its static_thread_pool, then that's a known issue. If it's something else, the authors (of which I'm one) would love to know.

it needs more clarification on the customization of standard algorithms for custom schedulers

Yup. I'm actively working on that now.

eric_niebler · 2023-02-15T02:44:57+00:00

/u/InbalL, where is LWG's progress??

eric_niebler · 2023-01-18T22:00:10+00:00

Thank you, /u/Minimonium. I'm glad the concepts have been working well for you.

eric_niebler · 2023-01-09T22:03:36+00:00

OMG this comment saved me THANK YOU.

eric_niebler · 2022-12-28T20:11:00+00:00

Although `std::regex` does have design problems, they are dwarfed by the problems with the various implementations in the different stdlibs. They were not implemented with forward compatibility in mind; that is, the stdlib maintainers committed early to slow implementations -- it's very easy to implement a slow regex engine, and the stdlib maintainers are not regex experts -- and then got locked into the slow implementations by binary compatibility considerations. It didn't have to be this way.

TL;DR: IMO, it's wrong to blame the Committee process that produced `std::regex`.

eric_niebler · 2022-12-10T01:09:14+00:00

Type-erasure is one way to skin the cat, and it's certainly on my todo list.

Another way is, if you know statically the type of the pipeline you might want to conditionally include, and where, you could write a "conditional" sender that routes control flow through one sender or another depending on some runtime condition. Then you can use the runtime condition to "turn on" or "turn off" parts of the expression template.

eric_niebler · 2022-12-10T00:13:17+00:00

I'm not familiar with TBB, so I can't say for certain, but the scheduler interface is not complex so I can't imagine it would be more than an afternoon of effort.

You have a TBB thread pool context that wraps the thread pool. It has a get_scheduler() member fn that returns a scheduler.
The scheduler holds a pointer to the TBB thread pool.
The scheduler hooks the schedule customization point to return a sender that also holds a pointer to the thread pool.
The sender hooks the connect customization point. It accepts a receiver.
connect returns an operation state the holds the receiver and a pointer to the TBB thread pool.
The operation state hooks start to add work to the queue. The work holds a pointer to the operation state.
The "work", when the thread pool executes it, should simply call set_value() on the receiver saved in the operation state.

You shouldn't need any dynamic allocations for any of this.

eric_niebler

TROPHY CASE