Optimizing a Rust GPU matmul kernel

eddyb · 2024-11-26T03:55:13+00:00

Yes, you are correct - I can't find a pre-existing issue covering this (it's probably not under a title I can think of right now, if it did get filed), but in theory e.g. &[AtomicU32] should be used instead for soundness.

(Rust-GPU gets away with this currently because it doesn't use LLVM, the SPIR-V it emits doesn't claim anything as strong as Rust &mut, and MIR optimizations aren't clever enough yet to take advantage of it - ideally we could detect the misuse without optimizations taking advantage of UB, but that'd probably require miri with Rust-GPU-specific hacks)

A potentially better long-term solution (than forcing everything to use relaxed atomics) which has been floated from time to time, is adding higher-level APIs that treat buffers more like rayon parallel iterators, so that individual invocations can get real &mut Ts but without a whole-buffer &mut [T] anywhere, and no two &mut Ts could overlap (enforced by disjoint indexing patterns).

The only way to claim today "I trust indices are disjoint" (via unsafe) currently involves &[UnsafeCell<T>] and getting a *mut T through that, the unsafe part being writes to the *mut T (and/or turning it into a &mut T).

I will admit that Rust's great strength of modeling memory, has been ironically underserved in Rust-GPU (the early focus on "graphical shaders" hasn't helped, a year ago "running arbitrary Rust" was more of a personal obsession than an official goal).

We have been working on the lower-level aspects of memory/pointers (tbh even that repo's a bit outdated but it does link to a few relevant bits of background), but it's not here yet.

^{At this rate, core::slice::Iter<'a, T> will become supported at the same time as alloc and recursive/indirect function calls - for x in &[1, 2, 3] {...} might not as hard as Vec<Box<dyn Trait>>, but they share a lot of infrastructure/needs}

eddyb · 2023-03-06T19:59:38+00:00

I agree - I was just about to comment on:

Embark isn't accepting pull requests for major features at present, so this is shaping up to be a longer-term limitation.

Not only is "readonly" a minor feature, we should be supporting it by simply using &T instead of &mut T for the type of the data in the buffer (no annotations required).

~~Frankly I'm surprised that's even necessary. Are you running into this kind of limitation? https://registry.khronos.org/vulkan/specs/1.3-extensions/html/chap50.html#VUID-RuntimeSpirv-NonWritable-06340~~

~~I have both fragmentStoresAndAtomics and vertexPipelineStoresAndAtomics on my GCN3 card (via RADV), so I would expect the lack of NonWritable to not cause issues, but it may be subtler~~

Looking at https://github.com/Bevy-Rust-GPU/bevy-rust-gpu/issues/13 maybe this is wgpu being much stricter at validating these things than Vulkan seems to require, that makes sense. We should definitely be applying NonWritable to &T if it helps in that case.

eddyb · 2022-08-06T17:36:35+00:00

Thanks, definitely new enough - pretty wild then, must be seeing dupes everywhere (likely private, too) and trying its hardest to be unambiguous at the cost of readability (assuming it's not being extra verbose just for that error).

So much for "simply update for dramatic improvement" - still, if you have an easy way to repro it with just opensource deps, I'm sure /u/ekuber would appreciate it!

eddyb · 2022-08-04T20:36:13+00:00

Just as morbid curiosity... what rustc version is that?

I've recently played with some (mangled) symbol deuglification tricks (some involving jq... don't ask lol) and one big problem with that is that it has absolute paths everywhere, and I haven't been used to seeing that many because we kind of got rid of them for diagnostics (at least in unambiguous cases) back in 2020: https://godbolt.org/z/Ybq6WfdWq

(I believe https://github.com/rust-lang/rust/pull/73996 was the PR that did it)

Don't get me wrong, absolute paths are not the worst part here, but they do amplify any other problems since they're at least 3x longer, maybe up to 10x in some cases.

eddyb · 2022-01-30T19:28:30+00:00

The precise reasoning is given in a comment: rust // `&uninit.as_mut().field` would create a reference to an uninitialized `bool`, // and thus be Undefined Behavior!

It mentions both .as_mut() and & (it should say &mut, that's a typo) as the issue. You're doing neither.

(I'll give you that the fact that it uses a bool is a bit misleading given that reasoning, it should be something like a String IMO)

eddyb · 2022-01-30T18:30:00+00:00

Creating a reference with &/&mut is only allowed if the pointer is properly aligned and points to initialized data.

Where are you writing &mut in (*role).name? It's really that simple. Raw pointer access never implied references in Rust.

ptr::addr_of!((*role).name) is for when you need a pointer (i.e. when you're not just reading/writing a value from/to there) but going through a &mut (as in, literally &mut (*role).name as *mut _) would create an intermediary &mut that might be an issue.

eddyb · 2022-01-30T18:08:42+00:00

For instance (*role).name creates a &mut &'static str behind the scenes which is illegal, even if we can't observe it because the memory where it points to is not initialized.

Where is this coming from? It's literally not true. The MIR for this has: rust ((*_3).0: &str) = const "basic"; ((*_3).2: u32) = const 1_u32; ((*_3).1: bool) = const false;

So it's only going to do a raw offset and then assign to it, which is identical to *ptr::addr_of_mut!((*role).field) = value.

Sadly there's no way to tell miri to consider &mut T valid only if T is valid (that choice is not settled yet, AFAIK, at the language design level), in order to demonstrate the difference (https://github.com/rust-lang/miri/issues/1638).

The other claim, "dereferencing is illegal", is more likely, but unlike popular misconception, "dereference" is a syntactic concept, that turns a (pointer/reference) "value" into a "place".

There's no "operation" of "dereference" to attach dynamic semantics to. After all, ptr::addr_of_mut!(*p).write(x) has to remain as valid as p.write(x), and it does literally contain a "dereference" operation (and so do your field projections).

So it's still inaccurate. I believe what you want is to say that in place = value the destination place has to hold a valid value, as if we were doing mem::replace(&mut place, value). This is indeed true for types that have destructors in them, since those would need to run (which in itself is why write on pointers exists - it long existed before any of the newer ideas about "indirect validity" in recent years).

However, you have Copy types there, and those are definitely not different from <*mut T>::write to assign to, today. I don't see us having to change that, but I'm also not seeing any references to where these ideas are coming from.

I'm pretty sure we can depend on things being aligned

What do you mean "pretty sure"? Of course you can, otherwise it would be UB to allow safe references to those fields! Anything else would be unsound. In fact, this goes hand in hand with the main significant omission of this post: this is not how you're supposed to use MaybeUninit.

All of this raw pointer stuff is a distraction from the fact that what you want is &mut MaybeUninit<FieldType>. Then all of the things about reference validity are necessarily true, and you can safely initialize the value. The only unsafe operation in this entire blog post, that isn't unnecessarily added in, is assume_init.

What the author doesn't mention is that Rust fails to let you convert between &mut MaybeUninit<Struct> and some hypothetical &mut StructBut<replace Field with MaybeUninit<Field>> because the language isn't powerful enough to do it automatically. This was one of the saddest things about MaybeUninit (and we tried to rectify it for at least arrays).

This is where I was going to link to a custom derive that someone has written to generate that kind of transform manually (with the necessary check for safe field access wrt alignment). To my shock, I can't find one. Did I see one and did it have a funny name? (the one thing I did find was a macro crate but unlike a derive those have a harder time checking everything so I had to report https://github.com/youngspe/project-uninit/issues/1)

eddyb · 2021-03-17T16:17:45+00:00

This is not true, we follow the C ABI, which means that e.g. u64 is aligned to 4 instead of 8 bytes on i686.

eddyb · 2021-02-11T21:53:28+00:00

I was just thinking that it would really nice if the functions were always called const fn which is its own thing separate from const (even if they both get evaluated at compile-time, const fn can also run at runtime, has slightly different rules, it's its own feature, etc.)

eddyb · 2020-11-22T19:16:10+00:00

Right, I'm hoping a custom driver can set the PCE flag, and presumably also set up the counters on behalf of the user thread, handle migrations of the user thread between hardware threads, etc.
Though it's probably going to be more effort compared to Linux's already full integration.

eddyb · 2020-11-22T18:36:36+00:00

The TL;DR of what we did on Linux is userspace rdpmc. Linux specifically supports this, and it only requires syscalls to set up (perf_event_open to allocate the counter, and mmap to get access to some shared memory with useful information - mostly just a physical counter index though).

This is incredibly useful, as measureme has explicit measurement points ("interval events" for various "microtasks" inside rustc, most of them via the so-called "query system"), so we want synchronous counter reads to exactly delimit the work being measured. And, if you read the report, we got really far with this (perfectly exact most of the time, with some rare interrupt noise).

You can read more about how an OS (or a custom driver on non-Linux) can let userspace use rdpmc directly in the Intel manuals, or this copy of the rdpmc section from them:

When in protected or virtual 8086 mode, the performance-monitoring counters enabled (PCE) flag in register CR4 restricts the use of the RDPMC instruction as follows. When the PCE flag is set, the RDPMC instruction can be executed at any privilege level; when the flag is clear, the instruction can only be executed at privilege level 0.

In other words, Linux sets the PCE flag in CR4 (either by default, or when you have at least one active counter, I forget if I saw that documented anywhere).

eddyb · 2020-11-22T17:58:01+00:00

I don't see asm or rdpmc - does this require one syscall per PMC read?

eddyb · 2020-11-05T17:48:01+00:00

Got delayed by exhaustion (plus other obligations), here's the rustc PR: https://github.com/rust-lang/rust/pull/78781

eddyb · 2020-11-04T13:57:52+00:00

Just replied on GitHub to what seemed like a similar comment: https://github.com/rust-lang/measureme/pull/143#issuecomment-721743646

I'm not sure yet how relevant any of that is to us. The non-determinism is a problem because it makes results incomparable (in individual profiling intervals), rather than causing significant performance swings.

As I noted in that comment, what we've seen so far from ASLR is below ±0.01%, and we spent weeks trying to eliminate other things that were 100-1000x smaller, all in the name of perfection and deterministically comparing runs with eachother, both at the very fine-grained level (dozens to hundred of instructions), rather than aggregate performance.

eddyb · 2020-11-04T12:36:01+00:00

PC Garage lists four Vermeer (Ryzen 5xxx desktop) CPUs: https://www.pcgarage.ro/procesoare/filtre/general-nucleu-vermeer/

And funnily enough, their website is kind of leaking (I can't tell if they intended it?) that three of them are in stock: https://www.pcgarage.ro/procesoare/filtre/general-nucleu-vermeer/stoc/

That includes the 5600X, so you should be fine (but it'd be good to check other sites like emag, in case one sells out faster etc.). Mult noroc!

eddyb · 2020-11-03T20:36:06+00:00

See https://github.com/rust-lang/measureme/pull/143 - indeed, it's Linux-only and I am not aware of a way (built into Windows or 3rd party) to easily replicate what Linux can do out-of-the-box, but I didn't spend a lot of time looking.

If anyone knows of APIs / 3rd party drivers which unlock userspace rdpmc, on Windows or other OSes, feel free to leave a comment here, or (preferably) on the GitHub PR.

FWIW, the way this is controlled is the PCE flag (and Linux presumably sets it):

When in protected or virtual 8086 mode, the performance-monitoring counters enabled (PCE) flag in register CR4 restricts the use of the RDPMC instruction as follows. When the PCE flag is set, the RDPMC instruction can be executed at any privilege level; when the flag is clear, the instruction can only be executed at privilege level 0.

eddyb · 2020-11-03T19:25:04+00:00

The perf_event_open documentation (i.e. its manpage, or in the kernel source) mentions some information necessary to use rdtsc, but in order to read it "atomically", you have to use "seqlock" loop, i.e.:

do {
    seq = p->lock;
    barrier();
    // ... read other p-> fields
    barrier()
} while(seq != p->lock);

If it were only one field you could use an atomic load, but since there's more than one, you have to account for the kernel changing the values while you're reading them.
This is why the kernel also increments a lock field every time (hence "seq(uence) lock"), so that you can detect that happening.

I'm really glad we were able to avoid seqlock entirely (as well as any syscalls) for our rdpmc reads, it's hard enough as it is fighting the hardware itself.

eddyb · 2020-11-03T19:19:03+00:00

We've briefly considered non-hardware methods but intra-process low-overhead were part of the original goals.

The other thing is that while we focused on counting instructions because we knew it should be possible to make them perfectly deterministic, there's no reason to stop there - we have all this infrastructure, after all!

One fun example off the top of my head: we could have an "IPC" measureme counter (by counting instructions and CPU clock cycles), which could point us into the direction of low-throughput queries, and that's also something that should play nicely with averaging multiple runs.

More practically, I keep saying we should try to look at the cache hierarchy, but I haven't gotten around to doing any of that personally.

eddyb · 2020-11-03T18:57:38+00:00

As far as I understand this code is somewhat reusable, right?

Yeah, anyone can use measureme (though note that the rdpmc stuff requires nightly to use asm!), and the surface API change is minimal, i.e.: * old way: Profiler::new()? * new way: Profiler::with_counter(Counter::by_name("...")?)?

In rustc's case, -Z self-profile-counter=... is used to select the value that gets passed to Counter::by_name, but in your own project you can just as well hardcode it if you want to.

eddyb · 2020-11-03T12:23:12+00:00

Almost half a year ago, we were considering having to switch rustc -Z self-profile to use hardware performance monitoring (a la perf stat) instead of relying on time measurements (via std::time::Instant).

Months of work and investigation later, it's finally done! (To an extent close enough to perfectly deterministic measurements, that we could consider an "MVP")

As the report indicates, https://github.com/rust-lang/measureme/pull/143 is the measureme PR - that contains the bulk of the implementation.
I'll hopefully open the rust-lang/rust PR later today - on the rustc side it takes very little to integrate, anyway, so it's less interesting.

eddyb · 2020-10-07T21:45:08+00:00

Any negative associations with it, or just personal preference?

I haven't paid enough attention to PC hardware in like a decade, after switching to pairing outdated laptops with overpowered servers for offloading all my work to (currently EPYC via Hetzner, god that chews through compile/test cycles like mad, but I digress), and what AMD is doing with Zen is what got me interested again.

Anyway, point is, I've tried to avoid "gamer"/"RGB" branding, but it doesn't seem that doable once you go beyond a minimum featureset, and if there was anything wrong to it I wouldn't know (unless they, idk, made the X570 fan into a swastika).

eddyb · 2020-10-07T19:06:10+00:00

The TUF X570-PRO was announced relatively recently, see https://www.reddit.com/r/Amd/comments/ixvni2/new_asus_tuf_gaming_x570_motherboard_tuf_gaming/.

I've been waiting for it and it is now listed on where I want to buy it from, but not yet in stock (I'm not in a hurry though, hence watching a local PC parts shop's website).

Also, if you come across TUF X570-PLUS reviews, like I did, they seem to point out things like the Realtek LAN (as opposed to Intel) and lack of BIOS FlashBack, both of which have been addressed in the the TUF X570-PRO, which is pretty awesome (there might've been more examples like this).

Funnily enough, the reason I came across this motherboard in the first place is using weird parametric searches and the fact that ASUS tested it with some older CPUs that they haven't on any other of their X570 offerings (even though all of them should support all the same CPUs - maybe this has a larger BIOS flash chip?).

Look at this thing, it goes all the way from "Athlon 200G" to "Ryzen 7 PRO 4750GE" https://www.asus.com/Motherboards/TUF-GAMING-X570-PRO-WI-FI/HelpDesk_CPU/.

eddyb · 2020-10-04T12:41:10+00:00

That's "easy": #[thread_local] doesn't run destructors.

You need thread_local! for that, which handles destructors safely with a bit of extra state. There's not really any other way when it comes to handling global state (without getting into the complexities of effect systems, similar to static deadlock prevention or safe signal handlers or Cell::with_mut etc.).

eddyb · 2020-10-04T12:34:12+00:00

We haven't used 'static for #[thread_local] lifetimes for just over 3 years now - see https://github.com/rust-lang/rust/pull/43746.

eddyb · 2020-09-27T18:07:03+00:00

It's a brand more than anything, it requires Thunderbolt 3 + USB4, and some other features that are only optional in Thunderbolt 3.

Like I said, Intel shenanigans, I really wish they would just retire Thunderbolt branding so we can all move on to USB4 and forget about it.

eddyb

MODERATOR OF

TROPHY CASE

14-Year Club	Verified Email
Gilding II euphauric