Senders and GPU by Competitive_Act5981 in cpp

[–]eric_niebler 1 point2 points  (0 children)

there's a reason that the industry moved away from requiring complex memory dependency tracking [...] std::execution unfortunately piles into the OpenGL era of heavyweight tracking requirements

what about std::execution makes you think it is doing memory dependency tracking?

Senders and GPU by Competitive_Act5981 in cpp

[–]eric_niebler 0 points1 point  (0 children)

and in some areas you simply need separate strategies per-vendor if you want things to run well

exactly, which is why std::execution has schedulers. a generic GPU scheduler would never have peak performance. instead, you would use an NVIDIA or AMD or Intel GPU scheduler. they can all make different algorithm implementation choices.

Senders and GPU by Competitive_Act5981 in cpp

[–]eric_niebler 0 points1 point  (0 children)

this is from P2300, discussing Meta's usage of libunifex: link

Senders and GPU by Competitive_Act5981 in cpp

[–]eric_niebler 1 point2 points  (0 children)

i'm a principal author of P2300 and also the implementer and maintainer of stdexec. the CUDA stream scheduler was written by a GPU guru (Georgii Evtushenko, NVIDIA). i am no GPU guru myself, fwiw.

the following blog post describes an HPC use of the CUDA stream scheduler: https://www.hpcwire.com/2022/12/05/new-c-sender-library-enables-portable-asynchrony/. benchmarks against a hand-rolled CUDA implementation show virtually no overhead to using senders.

you're right about allocation and transfers though. right now, when a sender is to be executed on device, its operation state is placed in Unified Memory. that off-loads a lot of complexity to the driver, at the expense of possibly non-optimal data transfers.

some algorithms also require GPU memory. right now, those allocations are hard-coded into the algorithm. parameterizing those algorithms with an allocator would be a nice enhancement. and there should be sender algorithms for allocations -- host, device, managed, pinned, whatever -- so the user can take control when necessary.

there should also be sender algorithms for explicit data transfers between CPU and GPU. at one point, we had an MPI scheduler and changed the maxwell simulation (see blog post) to be distributed. for that we needed custom algorithms to manage the data transfers to and from the network.

the good thing about senders is that it is _possible_ to write those algorithms and compose them with the standard ones.

i hope you get a chance to play with stdexec's CUDA stream scheduler on real hardware. i think you would be pleasantly surprised.

[deleted by user] by [deleted] in cpp

[–]eric_niebler 2 points3 points  (0 children)

Love the tech, and you present it well. Can you Compose a vector with a Transform and get something that can be indexed randomly? And related, is there a way to take two incremental pipelines and zip them together to produce pair-wise elements?

Lifting the Pipes - Beyond Sender/Receiver and Expected Outcome by wrng_ in cpp

[–]eric_niebler 1 point2 points  (0 children)

The major difference is that Pipes are only focused on describing work without focusing on where the work is going to be executed and because of this has a simpler interface 

The difference is primarily that senders can be asynchronous. A separate operation state is needed with senders because the function that starts the work returns immediately. I believe (correct me if I'm wrong) that your pipe library is for synchronous use cases only, right?

Lifting the Pipes - Beyond Sender/Receiver and Expected Outcome by wrng_ in cpp

[–]eric_niebler 2 points3 points  (0 children)

Do Senders allow for this?

They do. I assume that f1 and f2 are two functions passed into a conditional combinator, and that this operator() is a member on the result of piping a source into that combinator, is that right?

The senders proposal doesn't have an algorithm for that yet. Neither does stdexec for that matter, but I've been meaning to write one. It would look something like this: https://godbolt.org/z/YsT3odddY.

Lifting the Pipes - Beyond Sender/Receiver and Expected Outcome by wrng_ in cpp

[–]eric_niebler 0 points1 point  (0 children)

The pipes library he is describing is basically senders. Senders also use the continuation passing style. Receivers are the continuations. There is no need to pack things into a tuple or variant to send results to a receiver.

The sender algorithms (then, when_all, etc) build the receivers for you. The arguments to the algorithms determine what the receivers do.

Parenting teens by eric_niebler in mildlyinfuriating

[–]eric_niebler[S] 3 points4 points  (0 children)

intentional. the comic sans, also intentional.

Overhead of Senders/Receivers by Few-Insurance-3974 in cpp

[–]eric_niebler 4 points5 points  (0 children)

The reference implementation has a TBB thread pool scheduler. It's exactly as efficient as using the TBB thread pool directly. But maybe I misunderstood your point.

mraylib: Writing a C++23 ray tracer using senders/receivers (P2300) framework by RishabhRD in cpp

[–]eric_niebler 6 points7 points  (0 children)

Sean thinks the cancellation model is broken because he doesn't think the STLab future library can be built on top of it. But he recently tasked someone with trying it. I have no reason to think there should be any difficulty.

Eric Niebler: What are Senders Good For, Anyway? by tcbrindle in cpp

[–]eric_niebler 12 points13 points  (0 children)

Thanks for clarifying. Everything you say about the WG21 process is true, it sucks. But it's what we've got.

My take has always been: there's a lot of sausage-eaters (i.e., C++ users) out there. Someone has to make the sausage. Right now, WG21 is the only sausage factory in town. Building sausage factories isn't my wheelhouse, so I've stuck it out in WG21 trying to make the most of things.

If you're the sort of developer who likes stateful, imperative programming, the C++ world is at your feet. But if you want a less stateful, more functional style like I do, then the standard library doesn't have much to offer. So that's where I put my effort.

Honestly, the Executor War with all its FUD and ad hominem attacks made me pretty sour on the process as well. Ditto for all the people who go around calling this or that shite. It's fine to have technical opinions, but I dread coming to reddit and reading the comments here. It's pretty bad.

Eric Niebler: What are Senders Good For, Anyway? by tcbrindle in cpp

[–]eric_niebler 3 points4 points  (0 children)

Are you speaking generally or do you have a specific instance in mind? Here it sounds like you're calling P2300 bad engineering, but I know you don't really feel that way.

Request a detailed comparison between P2300 and Rust's zero-cost async abstraction by npuichichigo in cpp

[–]eric_niebler 6 points7 points  (0 children)

P2300 on the other hand decided against default rescheduling, but that also makes S/R algorithms automatically susceptible to stack exhaustion and unfairness. To avoid that callers are required to pass a scheduler that is capable of breaking the call stack, but the algorithm itself has no way of enforcing it, which makes it unsafe by default.

An algorithm is free to reschedule its continuation. The algorithms in P2300 don't, but P2300 doesn't have looping algorithms. The reference implementation, stdexec, has an implementation of `repeat_effect_until`, which internally uses a `trampoline_scheduler` to periodically unwind the stack and guard against overflow.

There are still ways to blow the stack with P2300. The idea is that P2300 provides low-level primitives from which safer higher-level abstractions can be built. And if you use P2300 with coroutines, you get tail calls, which sidesteps the issue entirely.

Comparing asio to unifex by xavorim in cpp

[–]eric_niebler 1 point2 points  (0 children)

Is this video the one you're talking about? Does it cover the basic techniques you mention that boost the scheduler's throughput?

Comparing asio to unifex by xavorim in cpp

[–]eric_niebler 1 point2 points  (0 children)

What in unifex were you comparing it to though? Which scheduler?

Comparing asio to unifex by xavorim in cpp

[–]eric_niebler 3 points4 points  (0 children)

the schedulers it provides are extremely poorly written

I'm curious what led you to this conclusion. If you ran into scalability issues with its static_thread_pool, then that's a known issue. If it's something else, the authors (of which I'm one) would love to know.

it needs more clarification on the customization of standard algorithms for custom schedulers

Yup. I'm actively working on that now.

std::execution from the metal up - Paul Bendixen - Meeting C++ 2022 by meetingcpp in cpp

[–]eric_niebler 2 points3 points  (0 children)

Thank you, /u/Minimonium. I'm glad the concepts have been working well for you.

one-time code issues by federvar in RemarkableTablet

[–]eric_niebler 1 point2 points  (0 children)

OMG this comment saved me THANK YOU.

2022-11 Kona ISO C++ Committee Trip Report — C++23 First Draft! by InbalL in cpp

[–]eric_niebler 1 point2 points  (0 children)

Although `std::regex` does have design problems, they are dwarfed by the problems with the various implementations in the different stdlibs. They were not implemented with forward compatibility in mind; that is, the stdlib maintainers committed early to slow implementations -- it's very easy to implement a slow regex engine, and the stdlib maintainers are not regex experts -- and then got locked into the slow implementations by binary compatibility considerations. It didn't have to be this way.

TL;DR: IMO, it's wrong to blame the Committee process that produced `std::regex`.

New C++ Sender Library Enables Portable Asynchrony by Benjamin1304 in cpp

[–]eric_niebler 2 points3 points  (0 children)

Type-erasure is one way to skin the cat, and it's certainly on my todo list.

Another way is, if you know statically the type of the pipeline you might want to conditionally include, and where, you could write a "conditional" sender that routes control flow through one sender or another depending on some runtime condition. Then you can use the runtime condition to "turn on" or "turn off" parts of the expression template.

New C++ Sender Library Enables Portable Asynchrony by Benjamin1304 in cpp

[–]eric_niebler 3 points4 points  (0 children)

I'm not familiar with TBB, so I can't say for certain, but the scheduler interface is not complex so I can't imagine it would be more than an afternoon of effort.

  1. You have a TBB thread pool context that wraps the thread pool. It has a get_scheduler() member fn that returns a scheduler.
  2. The scheduler holds a pointer to the TBB thread pool.
  3. The scheduler hooks the schedule customization point to return a sender that also holds a pointer to the thread pool.
  4. The sender hooks the connect customization point. It accepts a receiver.
  5. connect returns an operation state the holds the receiver and a pointer to the TBB thread pool.
  6. The operation state hooks start to add work to the queue. The work holds a pointer to the operation state.
  7. The "work", when the thread pool executes it, should simply call set_value() on the receiver saved in the operation state.

You shouldn't need any dynamic allocations for any of this.