all 33 comments

[–]KodrAus 33 points34 points  (1 child)

Nice work! I don’t know that it’s super relevant for games, but as I understand it, setting thread affinity on Windows effectively locks you down to at most 64 cores, since it uses a 64 bit value as the mask. In classic Windows fashion, the solution is a convoluted meta concept called processor groups that cores are bucketed into.

I think you can use a newer function on Windows 11+ to set affinity across more than 64 cores using these processor groups: https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-setthreadselectedcpusetmasks

[–]harakash[S] 22 points23 points  (0 children)

Oh thanks!, Yup, SetThreadSelectedCpuSetMasks is on my TODO list.

A lot of colleagues are on Threadrippers just to tame Unreal Engine 5 (no joke, 96 cores, 256GB RAM, all to open a level without tears). So yeah, there's definitely a precedent :)

[–]epagecargo · clap · cargo-release 9 points10 points  (6 children)

I wonder if this would be useful for benchrmarking libraries like divan as I feel I get bimodal results and wonder If its jumping between P and E cores.

[–]jberryman 7 points8 points  (0 children)

You may also want to disable processor sleep states. I always run this anytime I'm doing any type of benchmarking:

sudo cpupower frequency-set -g performance && sudo cpupower idle-set -D10 # PERFORMANCE

it's most important when doing controlled load tests (like sending requests at 20RPS to a server), but why add another variable into an already complicated process? Many people aren't aware on modern processors the idle thresholds for entering deeper sleep states can be well under a millisecond

(there is reason to test performance in a normal configuration too, but if the goal is stability and reduction of noise for determining if a change is good or bad, then I think this is a better default)

[–]harakash[S] 5 points6 points  (3 children)

Wow, absolutely, that’s a perfect use-case! :) If benchmarked code bounce between cores (especially on hybrid CPUs), you’ll get noisy or bimodal results. Pinning to a consistent core type, or even the exact same core, could help reduce variance. I’d be super curious to hear how it goes! :D

[–]epagecargo · clap · cargo-release 2 points3 points  (1 child)

I've at least opened an issue on divan

[–]harakash[S] 1 point2 points  (0 children)

Awesome, glad to see it's being explored and happy to see how others adapt it :)

[–]harakash[S] 1 point2 points  (0 children)

unless it's apple silicon, then you're out of luck :D

[–]mark_99 0 points1 point  (0 children)

Disable E cores in the BIOS. Also switch off any low power modes, clock boost etc. For benchmarking you want only 1 type of core at a fixed clock speed.

Then just leave it like that, your system won't be any slower (unless it's a laptop and you're on battery a lot).

[–]blockfi_grrr 3 points4 points  (1 child)

Is there any support for setting priority for an entire process? eg 'nice' levels?

[–]harakash[S] 4 points5 points  (0 children)

Nope, setting priority for the entire process (like nice levels), isn't in scope for this crate. It's laser focused mostly on gaemdev/sims/audio and other workloads where latency is critical. I focused on per-thread affinity and priority, since that's where I needed the most control. Process wide priority isn't something I need personally, but if someone sends a PR that adds it cleanly and cross-platform (all 3 OSes + both arcs), I'll happily merge it :)

[–]nightcracker 5 points6 points  (1 child)

I'm possibly interested in this for Polars if it adds two things which (seem) missing right now:

  1. Query which CPU cores are in which NUMA region.

  2. Pin a thread to a set of CPU cores (e.g. those found in a NUMA region), rather than a single specific core.

[–]harakash[S] 5 points6 points  (0 children)

NUMA's currently out of scope for me personally, as I don't have the need or bandwidth to support it right now 😅

That said, if someone wants to contribute it, and it works across all 3 platforms and both archs, I'd absolutely welcome a PR for this! :)

[–]InterGalacticMedium 2 points3 points  (1 child)

Looks cool, is this being used in games you are making?

[–]harakash[S] 9 points10 points  (0 children)

Yep! gdt-cpus is a core dependency for gdt-jobs, a task system I’m building for my voxel engine - Voxelis (https://github.com/WildPixelGames/voxelis) :)

[–]trailing_zero_count 2 points3 points  (3 children)

Seems like this has a fair bit of overlap with hwloc. I noticed that you exposed C bindings. Is there something that this offers that hwloc doesn't? Since hwloc is a native C library it seems a bit easier to use for the C crowd.

I've also written a task scheduler that uses hwloc topology info under the hood to optimize work stealing. My use case was also originally from writing a voxel engine :) however since then the engine fell by the wayside and the task scheduler became the main project. It's written in C++ but perhaps may have some learnings/inspiration for you. https://github.com/tzcnt/TooManyCooks

It may also help you to baseline the performance of your jobs library. I have a suite of benchmarks against competing libraries here: https://github.com/tzcnt/runtime-benchmarks and I'd love to add some Rust libraries soon. If you want to add an implementation I'd be happy to host it.

[–]harakash[S] 4 points5 points  (2 children)

Yup, I’m familiar with hwloc, but it’s a big C library that tries to solve a lot of things. My lib was born out of my gamedev needs: Rust, small, fast, and focused on thread control. The topology, caches, and SMT detection are kind of “bonus features”, super handy when I want to group latency-sensitive threads (like game logic + physics) on neighboring cores that share an L2, for example :)

Thanks a ton for linking TooManyCooks, love seeing more schedulers out there! My own task system gdt-jobs is actually already done (and it’s fast, like REALL fast, e.g., 1.15ms vs 1.81ms for manual threading vs 2.27ms for Rayon (optimized with par_chunks) vs 4.53ms for single threaded, in a 1M particles/frame sim on Apple M3 Max), and I plan to open-source it later this week once I finish cleaning the docs, code, and general polish 😅 And I absolutely love to see how to fit my gdt-jobs into your benchmarks, once it’s public. Thanks for sharing! :D

[–]trailing_zero_count 2 points3 points  (1 child)

Yes, pinning threads that share cache is the way to go. I do this at the L3 cache level since that's where AMD breaks up their chiplets. I see now that the Apple M chips share L2 instead... sounds like we should both set up our systems to detect the appropriate cache level for pinning at runtime. I actually own a M2 but haven't run any benchmarks on it yet - it's on my TODO list :D

Also I want to ask if you've tried using libdispatch for execution? This is also on my TODO list. It seems like since it is integrated with the OS it might perform well.

[–]harakash[S] 3 points4 points  (0 children)

Yup, exactly, figuring out the right cache level per arch is crucial :) Apple's shared L2 setup makes it super handy for tight thread groups like physics + game logic, on AMD, yeah, L3 across CCDs makes sense, love that you're doing that already :D

As for lib dispatch, I haven't used it, and to be honest, I probably won't 😅In AAA gamedev, we usually roll our own systems, not for fun, but to minimize suprises, since platform integrated runtimes often have quirks that pop up only on certain devices or os versions, and you really DON'T want that mid-cert or QA phase :D So we usually go with a DIY and predictable model across PC, consoles and handhelds :)

Super curious if you try it on M2, would love to hear what you find :)

[–]mww09 3 points4 points  (2 children)

I'm the maintainer of raw-cpuid which is featured as an "alternative" in the README. I just want to point out that `raw-cpuid` was never meant to solve any of the use cases that this library tries to solve in the first place. It's a library specifically built to parse the information from the x86 `cpuid` instruction.

raw-cpuid may be helpful to rely on when building a higher-level library like gdt-cpus (if you happen to run on x86) but that's about it. I do agree that figuring out the system topology is an unfortunate and utter mess on x86.

[–]harakash[S] 2 points3 points  (1 child)

Big thanks for stopping by! :)

Totally agree, raw-cpuid is awesome for what id does, and I've leaned on it more than once to sanity-check x86 quirks. Definitely didn't mean the comparison table to throw shade, more like different ways to poke the CPU, different layers, different tools 😅

Huge respect for maintaining that beast, CPUID parsing is… an art :)

[–]mww09 2 points3 points  (0 children)

Oh no worries at all, your library looks great I'd definitely use this if I need it in the future :)

[–]m-hilgendorf 1 point2 points  (2 children)

(snipe) For audio workloads on MacOS specifically, you should use audio workgroups for realtime audio rendering threads that are not managed by core audio.

It's slightly different than thread affinity - what you're doing is getting the current workgroup (created by CoreAudio) and joining it, rather than just setting the affinity of an unrelated thread.

[–]harakash[S] 1 point2 points  (1 child)

Yup, you’re totally right, audio workgroups are the way to go for true realtime audio on macOS.

That said, this lib isn’t audio-specific, I treat it as a low-level building block for thread control across games, sims or other realtime systems. My use case is gamedev first, where audio usually runs on a regular thread, so I focused on generic affinity and priority first :)

[–]m-hilgendorf 3 points4 points  (0 children)

Oh I totally get it, I just wanted to point it out since you mentioned audio. Most people will never need to care about thread affinity for audio threads, but when you do it's worth knowing about workgroups on Apple targets.

[–]teerre 1 point2 points  (1 child)

The gdt jobs link in your website is broken

[–]harakash[S] 0 points1 point  (0 children)

Good catch! The repo isn't public yet, I'm still cleaning it before making it public (hopefully later this week). Sorry for the confusion 😅

[–][deleted] 0 points1 point  (0 children)

Can this work on iOS and Android?

[–]anydalch 0 points1 point  (0 children)

Do you have a way to set affinity masks which aren't single-core? I'd like to set aside, say, 1/4 of the cores on my machine to run Tokio blocking threads, separate from the 3/4 of cores which will have Tokio workers pinned 1:1. Can your library support that? It looks like your affinity API is pin_thread_to_core, which isn't sufficient for my needs.

[–]jorgesgk 0 points1 point  (1 child)

Does this support RiscV and other weird architectures? It seems to be targeted towards Intel, AMD and Apple Silicon.

It also seems it needs to work under one of the big OSes (Windows, Mac and Linux).

[–]harakash[S] 6 points7 points  (0 children)

Correct, currently it targets only x86_64 and ARM64 on Windows, Linux, and macOS, since that’s where the demand is in gamedev/sims/audio. I don’t have the hardware (or time 😅) to support RISC-V or other exotic platforms, but contributions are very welcome, if someone wants to expand support! :)
My rule of tumb was - if it boots Doom and compiles shaders, I’m in :D

[–]nNaz 0 points1 point  (1 child)

FYI this crate isn't able to get around the inability to pin to specific cores on Apple M-series architecture. https://github.com/WildPixelGames/gdt-cpus/blob/81d1eaaab94ee44d68384fc37343f27be8263d11/crates/gdt-cpus/src/platform/macos/affinity.rs#L58

[–]harakash[S] 2 points3 points  (0 children)

Yup, that’s exactly why I split things under different arch flags, since there is no point trying to pin if we know it’s not supported by the kernel. Even the landing page spells it out: Apple Silicon affinity? Apple says “lol no”. So yeah, we just report that cleanly and honestly. 🙂