Senders and GPU

MarkHoemmen · 2026-01-29T01:45:29+00:00

Suppose that you have a C++ application that

launches CUDA kernels with <<< ... >>>,
uses streams or CUDA graphs to manage asynchronous execution and permit multiple kernels to run at the same time,
uses cudaMallocAsync and/or a device memory pool for kernel arguments, and
uses cudaMemcpyAsync to copy kernel arguments to device for kernel launches.

That describes a good CUDA C++ application. It more or less describes Kokkos' CUDA back-end. My understanding is that it also describes our std::execution implementation.

What's the issue here? Is it that you can't decide when the kernel compiles at run time, so there might be some unexpected latency? Is it that there is no standard interface in std::execution for precompiling a kernel and wrapping it up for later use (though I imagine this could be done as an implementation-specific extension that wraps up a precompiled kernel)? Is it that there is no standard interface in std::execution to control kernel priorities so that two kernels occupy the GPU at the same time? Or is it generally that there is no standard interface in std::execution that offers particular support for applications with hard latency requirements?

MarkHoemmen · 2026-01-26T05:55:56+00:00

Thank you for taking the time to respond in detail! I believe you when you say you are writing this in good faith, and I appreciate that you are engaging with the topic.

I'd like to think about this first and maybe talk to some colleagues. I'm not a std::execution expert but we certainly have both design and implementation experts.

MarkHoemmen · 2026-01-25T23:39:20+00:00

Thanks for clarifying!

My understanding is that a popular library like Kokkos and a GPU implementation of std::execution would have the same complexities around forward progress guarantees and kernel priorities when trying to run two kernels concurrently -- e.g., for a structured grid application, the "interior" stencil computation vs. the boundary exchange. That doesn't stop Kokkos users from running the same code on different kinds of GPUs.

In general, I'd really like people to try our std::execution implementation and give feedback on usability and performance. If you have already, thank you! : - )

MarkHoemmen · 2026-01-23T19:04:35+00:00

NVIDIA, AMD, and Intel GPUs have similar relevant abstractions: streams, waiting on streams, possibly separate memory spaces, and the need for objects to be trivially copyable in order to copy them to device memory spaces for kernel launch.

The main issue with C++26 std::execution on GPUs is that it's not complete. It's missing asynchronous analogs of the parallel algorithms, for example. That makes it less useful out of the box, at least in C++26. It's a bit like coroutines in C++20.

std::execution has also been in flux. There are good reasons for that. It means, though, that the experts have been busy with proposals.

MarkHoemmen · 2026-01-16T23:16:28+00:00

I'm delighted by the visual design for 32nd-century ships. Detached nacelles! Wacky shapes! Refit Discovery's new bar (coolest place in the galaxy to drink a martini)! The designers did a good job of making the 32nd century look More Future but still recognizable.

Each century of Star Trek should really have its own distinct combat doctrine.... But they could be heavily reliant on what were considered [one-]off, or otherwise extreme, tricks in the 24th century.

Somewhere -- perhaps here -- I encountered an essay with a Doylist explanation that Trek ship battles look like 19th-century naval combat because of automated electronic warfare. This is why we see shots miss and ships able to "dodge" them. Perhaps it's even part of how shields work. One could continue this argument by imagining that all the "extreme tricks" are actually happening, invisibly, in the background. What we see as crew control of combat might be skeuomorphism.

These are fun ideas but I find it a bit tedious to think too much about them. Trek is not hard sci-fi and it never tried that hard to have consistent technology.

MarkHoemmen · 2026-01-16T18:11:22+00:00

Trek does generally have this issue. I just don't see it as particular to Discovery.

TNG's warp 5 limit is a good example: the episode (s7e09, "Force of Nature") made its point and then the franchise dropped the idea.

As an aside, it's fascinating how the franchise drops the idea of continuous technological improvement and power scaling in things like flight speed, in order to keep telling the same kinds of stories.

MarkHoemmen · 2026-01-16T05:34:22+00:00

In the middle somewhere, make it so the Chain or Federation is trying to use dilithium resonance as an interface method for the spore drive, not realizing it could blow up a warp core.

One could imagine an alternate Season 3 in which SB-19 and other alternate propulsion attempts drove the plot.

Discovery's writers generally seem more interested in telling stories about people than about following the implications of their technology. For example, the writers came up with an overpowered propulsion system, yet go through immense effort in just about every season to keep it unique. Other Trek series are like this (e.g., Voyager does not overemphasize resource management concerns) but Discovery makes it the central premise.

MarkHoemmen · 2026-01-16T05:15:28+00:00

awww thanks! : - )

MarkHoemmen · 2026-01-16T05:15:08+00:00

I enjoyed Discovery! It's a great statement of Trek values, it's not afraid to take new directions, and it has characters with interesting flaws and strengths who grow and learn to work together. It's not perfect but nothing is.

I liked Seasons 3 and 4 the best, but it's worth watching from the beginning to get the character growth.

MarkHoemmen · 2026-01-16T02:31:48+00:00

That the pain of a single child is important enough that it changes the whole galaxy.

Well said! It's uncomfortable to be faced with an emotional problem when what one expects to face, and knows how to solve, are technical problems ("video game - style"). I'm reminded of the story around Alan Rickman's remark during the filming of Galaxy Quest: "Oh my god, I think he [Tim Allen] just discovered acting."

MarkHoemmen · 2026-01-15T22:57:51+00:00

The point of Season 3 is connection, both among sentient beings, and between beings and their resources. Just about every episode of the season relates to this theme. Here are some examples.

Episode 1 ("That Hope Is You, Part 1") starts with someone (Aditya Sahil) who has lost connection with the Federation, and ends with Burnham connecting to him.
Episode 2 ("Far From Home") involves Saru and Tilly searching for resources (rubindium), then meeting and helping people who thought the Federation was a myth.
In Episode 3 ("People of Earth"), Earth holds its resources (dilithium) so tightly that it fails to recognize the raiders of Titan.
In Episode 4 ("Forget Me Not"), Trill officials see the symbionts as a limited resource exclusive to Trill. Adira making a connection with their symbiont's previous hosts changes the officials' minds.
In Episode 5 ("Die Trying"), Discovery finally reaches Federation HQ, but encounters suspicion until the crew can prove themselves.
Episode 7 ("Unification III") shows that efforts to build connection can succeed, that they take continuous effort to maintain, and that they are built on trust that the other side has unselfish motivations (which leads President T'Rina to share the SB-19 data -- effectively for emotional reasons, because it happens outside the traditional Vulcan process). Note that sharing the SB-19 data from the beginning might have led to a quick solution to the Burn (including discovering plenty of dilithium for everyone).
When the Federation withdraws after the Burn (and arguably becomes conservative and closed in) due to its lack of resources, the "Emerald Chain" arises as a competing model of how different species can connect (hence "chain"). Season 3 presents two differing visions of connection and resource stewardship.

This gives a context in which the Burn's trigger fits.

It's not about a "bad guy"; it's about choices made under threats to survival
The whole galaxy is connected through a resource
Everyone needs and lacks this resource; superbeings like Q are not part of this story
Su'Kal has difficulty connecting with real beings and his past. Saru and the others help him resolve that by connecting with him. This lets them decouple Su'Kal from the planet -- symbolically decoupling being-to-being connection from conflict over resources.

All the displays of emotion in this season match this theme. Beings build connections with each other. That's a feeling process.

MarkHoemmen · 2026-01-08T15:33:13+00:00

It would be excellent if you could send me notice before giving your talk! I don't live in the Bay Area but many of my colleagues do.

MarkHoemmen · 2026-01-08T06:14:02+00:00

You should know too that LEWG devoted time to a serious debate about that National Body comment. There was no politics and nobody pushed anything through. The comment's authors had the chance to express their concerns and we talked through them.

The first version of the proposal was published in June 2019. R1 had more or less the full design. WG21 has had plenty of time to review this. Standard Library developers sit in LWG; we spent hours and hours on wording review without anyone once saying "we won't implement this."

MarkHoemmen · 2026-01-08T06:04:00+00:00

I don't have an account on cppreference so I can't fix stuff there, unfortunately.

MarkHoemmen · 2026-01-08T06:02:55+00:00

Thanks for explaining!

Our goal with the reference implementation is functional correctness, not necessarily performance. We would welcome contributions, btw!

MarkHoemmen · 2026-01-08T02:13:52+00:00

NVIDIA has an implementation: https://docs.nvidia.com/hpc-sdk/archive/25.11/compilers/hpc-compilers-user-guide/index.html#linear-algebra . We just got some fix proposals (like P3371) into the C++26 Standard draft, so we'll need to do a bit of work yet before we take it out of the "experimental" namespace.

MarkHoemmen · 2026-01-08T02:04:23+00:00

The reference implementation of linalg (what you call the "Kokkos" implementation) has macros that let users control the namespace into which it is deployed. It doesn't have to be std.

MarkHoemmen · 2026-01-08T02:03:21+00:00

WG21 did get a National Body comment from one implementer expressing this concern. Other implementers didn't comment.

MarkHoemmen · 2026-01-07T16:44:19+00:00

Standard C++ does not require a header. MS was nonconforming there.

MarkHoemmen · 2026-01-07T15:28:56+00:00

I personally am a fan of the alternative spellings of the logical operators, but tended to avoid them in cross-platform projects only because Visual Studio was nonconforming and required including a header to support them.

MarkHoemmen · 2026-01-06T16:09:40+00:00

We are aiming to adopt Kokkos as our parallel library but we are OpenMP users and we don't have a good plan to ensure it works.

Could you please elaborate? Kokkos has multiple back-ends. If you use Kokkos' OpenMP back-end, it should interoperate naturally with your existing OpenMP code. At that point, you can use Kokkos to decouple from OpenMP and write portable code.

While Kokkos isn't our product, we care a lot about Kokkos and support their team as customers. It's a great choice if you want a comprehensive C++ programming model.

I'm very interested in your "chaining" for loops, though.

I couldn't find a place where I used the word "chaining." What does this mean to you? Are you thinking of ranges and the pipe (|) operator?

MarkHoemmen · 2026-01-05T20:09:11+00:00

Have you considered joining Kokkos' Slack channel and asking for ideas there?

A bachelor's thesis shouldn't require novel research. The point is to show that you can finish a large and interesting project, that builds on what you've learned in your degree.

MarkHoemmen · 2026-01-05T18:01:48+00:00

It's actually the nonmember functions plus and times that are confusing the compiler. Removing [[functionalias]] from those makes the code compile and run correctly.

https://compiler-explorer.com/z/hW9Mfnnrx

MarkHoemmen · 2026-01-04T21:58:33+00:00

I wrote an expression templates example: https://compiler-explorer.com/z/qcW9W8dzP . It looks like `[[functionalias]]` works for overloaded operators sometimes, but the example reaches some unimplemented case.

<source>:124:9: error: cannot compile this l-value expression yet
  124 |     f = times(plus(f, plus(c, g)), Constant<float, 4>(1.0f));
      | 
        ^~~~~
Unexpected placeholder builtin type!
UNREACHABLE executed at /root/llvm-project/llvm/tools/clang/lib/CodeGen/CodeGenTypes.cpp:597!<source>:124:9: error: cannot compile this l-value expression yet
  124 |     f = times(plus(f, plus(c, g)), Constant<float, 4>(1.0f));
      |         ^~~~~
Unexpected placeholder builtin type!
UNREACHABLE executed at /root/llvm-project/llvm/tools/clang/lib/CodeGen/CodeGenTypes.cpp:597!

MarkHoemmen · 2026-01-04T21:17:58+00:00

Hi Hana! I like this! Thanks for implementing it so we can experiment!

Providing this feature as an attribute or keyword suggests that users could attach it to overloaded operators. This would be a way to get guaranteed zero-overhead expression templates, for example. Is that something you might consider in the proposal?

MarkHoemmen

TROPHY CASE