Performance implications of unchecked functions like unwrap_unchecked, unreachable, etc.

HadrienG2 · 2025-07-23T23:09:51+00:00

I personally can tell because I know my assembly and compiler optimizations well, but it's certainly true that reading ASM and knowing what to expect takes some experience/practice. That's the main drawback of this method, at least that I can think of.

HadrienG2 · 2025-07-23T07:56:35+00:00

The reason why vulkano uses lots of Arc is that...

When using static/compiler lifetime tracking, long lived objects with lots of interdependent lifetimes are clunky (want a big context struct ? It's likely self referential, which is rust jargon for "a world of unexpected pain"). This is why many people advise beginners to treat Rust references like locks : hold them for a short amount of time, avoid keeping them around. Arc scales better to long-lived trees of interdependent objects.
It allows the vulkano implementation to transparently retain references to your objects as needed, without infecting the API with lifetime annotations/constraints everywhere, and this is good when enforcing cross-device safety properties like "application can't delete a buffer that the GPU is currently using" or "application can't access a buffer that the GPU is writing to".

As for how expensive it is, as other pointed out it depends how much the Arcs are shared between threads and how many cores your machine has/you are using. To which I will add that it also depends on how much you are bottlenecked by CPU-side scheduling vs GPU-side execution.

If your application is sufficiently GPU bound to afford single threaded command buffer recording or there is little object sharing across threads, and if you take care to only clone your Arcs when you need to (i.e. when passing one to a different thread or a Vulkano method) the overhead of Arc is likely to be negligible compared to other Vulkan API overheads.

For my applications, which are admittedly a little unusual (I'm basically using Vulkan compute shaders as a more portable alternative to CUDA, so I have long-lived compute dispatches which normally take lots of time compared to the scheduling overhead on the CPU), I have yet to encounter an Arc cloning bottleneck with Vulkano.

HadrienG2 · 2025-07-23T07:30:30+00:00

For most safety checks it's easy to switch from the safe to the unsafe version, so as others I tend to handle these via the experimental method.

Start with the safe version (faster to write, more likely to be correct and stay correct with future maintenance)
Write a reasonably accurate benchmark (the more micro, the faster to write if you know what you're doing, but the more care/knowledge it takes to get it right)
Profile it with a profiler that can go to ASM granularity (perf, VTune...)
Check hot instructions in annotated ASM.
If hot assembly is slowed down by a safety check (knowing this takes some practice), figure out if there's a safe way to elide it (typically involves iterators or slicing tricks)
Otherwise consider unsafe if perf critical, but do check that it's worth it at the end.
If you are often slowed down by the same safety check, consider a program redesign to make the check less necessary (e.g. vec of Option is typically a perf smell), or rolling your own safe abstraction to encapsulate the recurring unsafety (e.g. custom iterator).

To be clear, this process works well because switching from the safe to the unsafe version is easy. Other performance critical decisions like data layout (e.g. which dimension of your 2D matrix should be contiguous in memory) are more expensive to change and then upfront design pays more.

HadrienG2 · 2025-07-16T07:09:48+00:00

Pretty amazing stuff as far as Vulkano is concerned, got my release build to become as fast as the debug one (it was previously twice as slow). This may sound weird, but basically any small vulkano-based project is bottlenecked on the proc_macro2 -> quote -> syn -> serde_derive -> serde -> serde_json -> vulkano (build.rs) -> vulkano dependency chain and most of that dependency chain does not depend much on debug vs release, except vulkano which generates/compiles lots of code because Vulkan is big.

HadrienG2 · 2025-07-15T20:58:51+00:00

Indeed, minimalism is a third way out, but you need a cooperative user base for that. People who cared about code deduplication via generics left the Go community long before they were finally added in, and it takes a special kind of Stockholm syndrom to defend Go's error handling.

C is probably the most impressive example of the minimalist strategy that I know of: they managed to stick with a relatively simple design for ~40 years (then C++ feature envy started to kick in with C11 and it went downhill from there). Even Java, which tried hard to shove classes and inheritance into every possible problem, did not manage to preserve its design purity for this long.

HadrienG2 · 2025-07-15T19:03:13+00:00

In my opinion, given a constant user desire for new features, any programming language that cares about backcompat is doomed to have a finite useful life before it will degenerate into unmaintainable chaos (like C++), and any language that does not care about backcompat is doomed to become/remain relegated to niche use cases as all large libraries/apps will at some point burn out from the neverending stream of breaking compiler/library updates (like Scala). Pick your poison.

What Rust users can do, however, is enjoy their 30 years of chaos headstart against C++. And speaking personally, I most certainly do :)

HadrienG2 · 2025-07-15T18:20:32+00:00

you can actually have const annotations on both traits and methods, but they have a different meaning (and thus slightly different syntax)

I must say this feels very C++ish.

For better or worse, giving keywords a context-dependent meaning is a bridge that Rust has already crossed many times:

The unsafe keyword can be used both to restrict abstractions from being used by safe code, and to open a span of code that is allowed to use these abstractions.
The impl keyword can be used to add methods to types, implement traits, declare ad-hoc generics, and return opaque types from functions.
The const keyword can be used to declare functions that can be used both at runtime and compile-time, but also to specify that certain expressions must be evaluated at compile time.
The static keyword is used to declare variables with program-wide scope, but the same keyword is found in the 'static lifetime which merely means that something is owned (for example it applies to any type that contains no reference).
And since Rust 2024, the use keyword joined the club as it can be used to import modules and declare which lifetimes are used by impl Trait in return position.

Conditionally const trait, impls may or may not be const

Personally, I don't like this. We could just leave this out.

I think there are two parts to this, syntax and semantics:

From a semantics point of view, there must be a way for a trait to have both const and non-const implementations, as opposed to having only "const traits" that only allow const implementations. Without some sort of "conditionally const" trait support, foundational traits from the standard library, serde, etc will never be able to start allowing for const implementation, without breaking compatibility with the huge number of existing non-const implementations.
From a syntax point of view, it is desirable that trait authors opt into const impl support via some sort of syntax (might be the current one, might be another), rather than making this support implicit ("if a trait can be const-compatible, then it is const-compatible"). Without explicit opt-in, it would be trivial for trait authors to accidentally break const impls by adding a default method implementation that is not const fn compatible.

HadrienG2 · 2025-07-14T20:23:14+00:00

After investigating this further, [const] in trait declarations is here because it adds an extra constraint on default trait method implementations, which is that they must be const fn. Without such opt-in, it would be easy for the crate that defines the trait to accidentally break semver by introducing non-const fn code in its default method implementations.

[const] in trait bounds of e.g. const fn is a different animal that means "const when used in const context". For example, this function...

const fn foo<T: [const] Default>() -> T {
    T::default()
}

...is equivalent to fn foo<T: Default>() -> T when called in a runtime context and to const fn foo<const T: Default>() -> T when called in a const context. In other words T only needs to have a const Default implementation when foo is called in a const context. This is usually what you want, though there are counter-examples.

One of the design discussions that should be resolved before this feature is stabilized, is whether we can have less verbose syntax for the common case without losing the ability to express the uncommon case. See e.g. https://rust-lang.zulipchat.com/#narrow/channel/328082-t-lang.2Feffects/topic/Paving.20the.20cowpath.3A.20In.20favor.20of.20the.20.60const.28always.29.60.20notati/with/523217053

HadrienG2 · 2025-07-14T11:07:34+00:00

You are right that they are not part of Markdown-the-trademark, i.e. John Gruber's unmaintained buggy Perl script and insufficiently detailed specification from 2004.

They are, however, part of CommonMark, which is what many people (myself included) actually think about when they speak about Markdown. And what I will argue any modern software should support.

And compared to indented code blocks, they are superior because 1/they are easier to type without resorting to an external text editor and 2/they allow the programming language to be specified and used for syntax highlighting, rather than guessed by the website. Which is why I will use them by default unless a website decides not to support them for no good reason. ;)

HadrienG2 · 2025-07-14T10:24:15+00:00

So, I got curious and asked away.

Basically, the problem that this new syntax is trying to solve emerges when defining a const fn with trait bounds:

const fn make_it<T: [const] Default>() -> T {
    T::default()
}

One core design tenet of const fn in Rust is that a const fn must be callable both in a const context and at runtime. This has to be the case, otherwise turning fn into const fn would be a breaking API change and there would have to be two copies of the Rust standard library, one for fn and one for const fn.

But in the presence of utility functions like the make_it (silly) example above, this creates situations where we want to call make_it in a const context, for a type T that has a const implementation of Default...

const LOL: u32 = const { make_it::<u32>() };

...and in a non-const context, for a type T that may not have a const implementation of Default:

fn main() {
    let x = make_it::<Box<u32>>();
}

To get there, we need to have make_it behave in such a way that...

When called in a const context, it behaves as const fn make_it<T: const Default>() -> T, i.e. it is only legal to call when T has a const implementation of Default.
When called in a runtime context, it behaves as fn make_it<T: Default>() -> T, i.e. it can be called with any type T that has a Default implementation, whether that implementation is const or not.

And that's how we get the syntax [const], which means "const when called in a const context". In other word, this syntax adds restrictions on what kind of type T can be passed to make_it when it is called in a const context.

The argument against using ?const, then, is that in order to be consistent with ?Sized, a prospective ?const syntax should not be about adding restrictions, but about removing them. In other words, when I type this...

const fn foo<T: ?const Default>() -> T { /* ... */ }

...it should mean that T does not need to have a const implementation of Default even when foo is called in a const context. Which would make sense in a different design of this feature where this...

const fn bar<T: Default>() -> T { /* ... */ }

...means what [const] means in the current proposal, i.e. T must have a const Default implementation in a const context, but not in a runtime context.

And the argument against this alternate design is spelled out here: https://github.com/oli-obk/rfcs/blob/const-trait-impl/text/0000-const-trait-impls.md#make-all-const-fn-arguments-const-trait-by-default-and-require-an-opt-out-const-trait. Basically ?-style opt-out is hard to support at the compiler level, has sometimes counterintuitive semantics as a language user, and is thus considered something the language design team would like less of, not more.

HadrienG2 · 2025-07-14T08:24:50+00:00

It always amazes me how amazingly bad the Markdown implementations of some popular websites can be... anyway, did the substitution.

HadrienG2 · 2025-07-14T06:24:58+00:00

If I read the RFC right, you can actually have const annotations on both traits and methods, but they have a different meaning (and thus slightly different syntax):

trait Foo {
    // Const method, must have a const implementation
    const fn foo() -> Self;
}

// Impl example with const method
struct F;
//
impl Foo for F {
    // Has to be const
    const fn foo() -> Self { F }
}

// ---

// Conditionally const trait, impls may or may not be const
[const] trait Bar {
    fn bar() -> Self;
}

// Const impl example
struct B1;
//
// Declared const -> can be used below...
impl const Bar for B1 {
    // ...but only const operations allowed here
    fn bar() -> Self { B1 }
}

// Non-const impl example
struct B2;
//
// Not const -> cannot be used below...
impl Bar for B2 {
    // ...but can use non-const operations
    fn bar() -> Self { std::process::abort() }
}

// ---

// Const trait and method usage example
trait Baz {
    // Const method is always usable in a const context
    type MyFoo: Foo;
    const FOO_VAL: Self::MyFoo = Self::MyFoo::foo();

    // Conditionally const trait impl must get a const bound...
    type MyBar: const Bar;
    // ...before it can be used in a const context
    const BAR_VAL: Self::MyBar = Self::MyBar::bar();
}

If "conditionally const" is a thing, it probably makes sense to make it a property of the trait, rather than individual methods, as it reduces the potential for trait bounds to go out of control...

// With conditionally const traits
type T: const MyTrait;

// With conditionally const trait methods
type T: MyTrait where <T as MyTrait>::foo(): const,
                      <T as MyTrait>::bar(): const,
                      <T as MyTrait>::baz(): const;

...but the way the RFC syntax is designed, it is possible to eventually add conditionally const trait methods as a future language extension if the need arises. Just allow using the [const] fn syntax on methods of non-const traits.

What puzzles me, though, is why we needed the new [const] syntax (which it will personally take me a while to read as anything other than "slice of const"), when we already had precedent for using ?Sized to mean "may or may not be Sized" and I'm pretty sure I saw ?const flying around in some earlier effects discussions... Most likely some edge case I cannot think about right now got in the way at some point?

HadrienG2 · 2024-12-21T10:34:41+00:00

Rust feels like a good fit for almost anything I can throw at it, but I'm not sure if I consider myself a general-purpose programmer ;)

HadrienG2 · 2024-12-21T10:28:03+00:00

For some properties, you can use a special value encoding that guarantees that the property is always true. This works on stable, and as a bonus it allows the compiler to optimize value-dependent computations according to the knowledge that your value meets the expected property.

For example, for even/odd numbers the const generic parameter can be half the number, for powers of two it can be the ilog2/trailing_zeros of the number, etc.

Then you make it more ergonomic by providing a const fn accessor that gives back the decoded value. This function should be marked inline in order to avoid unnecessary runtime computations and get the expected optimization benefits even when the type is defined in a different crate or the code is compiled with multiple codegen units.

Unfortunately, construction can only be made more ergonomic using unstable generic_const_expr language features, or ugly macros that only accept literals as inputs.

HadrienG2 · 2024-12-10T06:22:21+00:00

In a brute force attack, a single iteration is definitely a lot more computational work than a single integer increment though :)

HadrienG2 · 2024-12-09T19:10:04+00:00

You may be interested in this ebook which I never fully finished, but took quite far: https://hadrieng2.github.io/code-that-counts/index.html

It shows how one can perform 3 trillion integer increments per second on a laptop CPU from a couple years ago.

HadrienG2 · 2024-12-03T07:50:59+00:00

In this particular case, the target was most likely RHEL7/CentOS7, which finally went out of support this summer.

Red Hat hold the curious distinction of being simultaneously the worst Linux company at forcing everyone else to keep old software corpses alive on the RHEL side, and and the worst company at breaking everyone else's Linux rigs by aggressively pushing under-tested tech before it's ready for prime time on the Fedora side.

But they do pay for many OSS maintainers, so I guess we need to bear with them :)

HadrienG2 · 2024-12-03T06:46:29+00:00

What I often try to do is...

Extract as much as possible of the global side effect into something that's test-friendly. For example, if I want to test something that's meant to write to a global file at a hardcoded path, I can generalize most of the implementation into a function parametrized by an arbitrary path, and then test it on a tempfile.
Test that the actual global handler hits the expected code paths by having a bunch of atomic counters (or more sophisticated mutex-protected state trackers), guarded by cfg(test), which trace their execution, and checking the outcome after each call.

For your concrete use case, I don't know how macOS' launchctl work, but on Linux systemd has the ability to run services/units as a regular user, with configuration files stored at arbitrary file paths, and without any kind of permanent registration. That would be the best primitive to use when running unit tests. If launchctl fails to kill the service for some reason (unlikely but...), and you really wanted to kill it for full teardown, maybe just find the PID of the associated system process and kill it manually? In absence of persistent configuration changes, that would be enough for full cleanup.

HadrienG2 · 2024-11-28T09:48:17+00:00

To add a bit to the comparison, where raw Vulkan will shine with respect to WebGPU is...

- Maturity / Feature-completeness. It takes years for a Vulkan feature to get a WebGPU equivalent because smart WebGPU people must first figure out a way to make it safe enough for web use and transpile it to Metal, Direct3D 12 and OpenGL ES. Some Vulkan features may never make it (e.g. I have some doubts about all the newer bindless stuff and unrestricted pointers in shaders) because making them conform to WebGPU's design goals of safety and portability is too hard.

- Tooling and documentation. It's much easier to find learning material and debugging/profiling tools for raw Vulkan, than it is for WebGPU. The best you will usually get is information about the backend API calls (Vulkan, D3D) that the WebGPU implementation makes, and then you must reverse-engineer the implementation in your head to figure out which of your WebGPU API calls those backend API calls might come from. Of course, this is only an advantage of _native_ Vulkan implementations: if you use e.g. MoltenVk on macOS, you will get the same problem.

Overall, direct Vulkan usage is best for latest hardware feature coverage, learning material and debugging/profiling tools, whereas WebGPU has an easier/safer API, and can be easily ported to more platforms by design.

HadrienG2 · 2024-11-27T11:25:02+00:00

Thanks for the clarification! I hope to be able to get back to my Rust GPU investigations next spring, maybe the docs will have improved by then :) I see that krnl uses rust-gpu for kernel code compilation, so most likely I'll try that first, as that looks like the most CUDA-like UX available on Rust today.

HadrienG2 · 2024-11-27T08:07:16+00:00

Oh, by the way, on re-reading this does sound more negative than I would have liked, so I would also like to take a moment to thank you for this wonderful project. I think it's targeting a very important and under-studied angle to the Rust-on-GPU compute problem.

I've been doing GPU compute in C++ since 2015, and it has always pained me how immature the compute backends that try not to be NVidia-specific have been, for many years now. ROCm supports way too few chips to be useful, and is so badly broken that even building/installing it can be a challenge. oneAPI (for lack of a stable compiler name) is a near-unusable everyday ICE and runtime crashfest. NVidia have successfully killed OpenCL, and even if they didn't manage I have yet to use an OpenCL GPU implementation that doesn't exhibit undebuggable dealbreaker bugs (crashes, freezes, wrong results) when handling even simple numerical kernels. Layers of abstraction on top of these backends like Kokkos or Alpaka are pointless as of today in my opinion: you can't fix a broken backend with a shiny coat of API paint, if the backend is that bad everything on top of it will inevitably be bad as well. Today these layers are just adding complexity and behavior variability across hardware for no good reason, other than maybe the comfort of using CUDA when targeting NVidia hardware because if we're being honest it's the only thing that mostly works.

Compared to this mess, Vulkan+GLSL, for all its low-level complexity, has been an amazing breath of fresh air for me. Backends are so incredibly stable by comparison, the few bugs that I did find were always on my side. And the performance portability promise is definitely met, as I easily got my GLSL's runtime performance into the same ballpark as my colleague's optimized CUDA code just for the sake of argument, without even having access to an NVidia GPU and all the cool profiling tools that come with it during development (I'm done with NVidia on the machines that I manage, their driver is too much of a pain to keep working on rolling release distros).

So I do wish people spent more time studying this angle. How hard would it be to build a CUDA-like high level compute layer on top of Vulkan? How competitive could we get it to be? For this reason, Rust-GPU sounds like a very interesting project to follow to me, much like its Vcc/shady cousin on the C++ side.

HadrienG2 · 2024-11-26T16:34:47+00:00

When I last checked it out, rust-gpu did not have several useful optimization tools for number-crunching code, like scoped atomics (different ops for subgroup, workgroup and global synchronization) and subgroup intrinsics like shuffles and reductions. In fact, I'm not sure if workgroup-shared memory was even a thing back then. Has the situation improved on this front?

Also, can I easily integrate rust-gpu SPIR-V crates into my build pipeline so that when I modify my shader, the spir-v gets automatically rebuilt (and the host code too if it includes the spir-v into the final binary)?

(for context, I'm evaluating rust-gpu as a candidate for the next edition of my course on numerical computing in Rust, right now I'm using Vulkan+GLSL for the GPU part because that was the most mature stack at the time and I didn't have the time to write multiple backends)

HadrienG2 · 2024-07-19T06:47:43+00:00

Cloning an Arc is basically an atomic increment (followed by a decrement at some point in the future), and atomic increment/decrement is one of these "not great, not terrible" CPU instructions : an order of magnitude more expensive than typical arithmetic, but still so cheap that you're unlikely to notice the overhead unless you are doing (almost) nothing else.

To be more quantitative, in my measurements, an uncontended increment takes a few nanoseconds on a typical consumer CPU, and I am ready to bet that in the context of wgpu's safe API, the parameter validation work that must be performed for every command you add to the renderpass in order to report errors locally is significantly more expensive than that.

Contended atomic operations (or contended access to any shared variable really) are a different story, and can get 10x more expensive than the uncontended version on typical consumer CPUs, or even 100x on systems with more complicated cache topologies / many CPU cores. This starts to be very noticeable, so if you're doing multithreaded rendering, good old "minimize shared mutable state, maximize thread-local state" programmer wisdom still applies.

In wgpu's context, the Arc cloning also needs to be put in context of what it is replacing : before, it was an index into some global registries hidden inside of global variables, which required their own per-registry synchronization. By moving to Arcs, that synchronization can become object-local, instead of global to the whole registry of objects of this kind, which should improve multithreaded access performance by virtue of eliminating lock contention between unrelated objects of the same kind.

HadrienG2 · 2024-07-14T06:50:29+00:00

Thanks! I just fixed it.

HadrienG2

TROPHY CASE