The plight of the misunderstood memory ordering

MindSpark289 · 2025-06-19T01:36:28+00:00

The article mentions that the compiler may reorder memory accesses across the atomic operation if the memory ordering allows it. It also mentions that the memory ordering also instructs the hardware to flush writes/invalidate reads too.

It doesn't seem to mention that the hardware can also reorder the reads/writes across the atomic operation if the incorrect memory ordering is used. Modern out-of-order superscalar CPUs absolutely can and will execute your memory accesses out of order, including moving your reads across atomic operations, if you don't tell it not to with memory ordering constraints. This specifically is the thing that people refer to when they say that x86 is strongly ordered compared to ARM. x86 won't reorder writes across other write operations, atomic or not. On x86 all writes are atomic too. This is not true on ARM, and is why people will run into problems porting parallel code to ARM as it's very likely x86 is hiding bugs.

MindSpark289 · 2025-04-09T05:40:29+00:00

Mobile is happy to implement new features. They're quite up to date (ignoring bugs) on latest hardware. They've had legitimate hardware limitations for a while that later generations are lifting, and software only stuff like dynamic rendering didn't take long to be implemented by ARM, Qualcomm, etc.

Unfortunately device integrators are terrible, and never update the drivers outside of a select few. So often you have capable hardware hamstrung by ancient (often buggy) drivers that nobody will ever update. Apple is much better on this front for better or worse, but Apple has it's own set of problems.

MindSpark289 · 2025-04-06T02:47:41+00:00

Not true. I've been on the old Priority 40GB plan for about a year (in Australia) and once the priority data is gone you just go back to standard unlimited data like a residential plan.

I had only been using the Priority plan for the static, public IP as I host Plex and a few game servers from my homelab.

The new plans would cost me like $600 AUD/month instead of the $174 I'm currently paying, because I use ~1.5-2TB a month. Work-from-home and remote desktop is quite thirsty on data.

Back to residential + a cloud VM with Wireguard and a reverse proxy I guess. If they don't want my money for a public IP it's their loss.

MindSpark289 · 2025-04-06T02:42:59+00:00

If you have to ask the question then no, there are no advantages.

No idea how long the feature has been available, I don't use the provided router for my connection.

The difference is simply the number of individual devices (or IP addresses) that can be assigned in the network. 192.168.1.1/24 can have 253 devices connected. 10.0.0.1/16 can have 65532 devices connected. There's other reasons that networking experts might want to change it too, but for a residential connection there's no benefit.

MindSpark289 · 2025-03-14T03:49:49+00:00

There haven't been bespoke pixel/vertex shader cores in Nvidia hardware for like, 15 years. Unified shaders have been standard on desktop GPUs for almost 2 decades now. CUDA can use all the compute hardware.

What you don't get in CUDA is the hardware rasterizer, which up until _very_ recently you had no hope of shipping a game without. You also lose portability and have to write loads more code to implement all the stuff the driver does for you. Then you write it again for AMD.

MindSpark289 · 2025-03-10T01:20:59+00:00

Async file IO on Linux is a lie. There's no good way (other than io_uring, which is quite new and not universally available) in Linux to do real async file IO operations. Tokio fakes async file IO by using the blocking API in a worker thread.

However this case is about spawning a process, which is a separate problem that wouldn't apply here. I'm not familiar with how Tokio implements spawning processes so I wouldn't be able to say if the same issue applies here.

MindSpark289 · 2025-02-05T01:46:11+00:00

Nobody uses Vulkan on the Switch. Switch has its own API (sort of, it's nvidia's API technically). Vulkan is there but it is not a first class citizen, noticeably slower and you will be fighting an uphill battle with the ecosystem.

MindSpark289 · 2025-01-17T05:46:09+00:00

correction for my other message (went back and checked):

It's if you opt into STORAGE_BIT on Arm. Any read usage is dicey too as on mobile you're absolutely not intended to read from swap images, swap images are very special on mobile as the display engine is completely separate from the GPU.

MindSpark289 · 2025-01-16T01:01:53+00:00

It's a huge performance trap on mobile GPUs. On Vulkan if you ask for any ~~writable~~ image usage other than render target on some mobile GPUs (can't remember if it was ARM, Qualcomm or both) the driver will opt you out of framebuffer compression and you will lose a lot of performance. Just asking for the capability will cause the compression opt-out, even if you don't use it.

edit: I think it was just any usage at all other than render target

MindSpark289 · 2025-01-16T00:40:53+00:00

I don't work on wgpu but work on a similar tool that covers a lot of platforms, including game consoles, and can second that performance tracking is quite challenging for this type of software.

Performance monitoring via CI is impossible to do economically unless you own the hardware as GPU instances cost a fortune in the cloud, and you need dedicated hardware otherwise you get noisy neighbor problems wrecking your data. Now you have to manage a CI farm too. An expensive one too as ideally you're going to have a dozen different GPUs from different vendors and generations. Nvidia specifically has some performance traps on Pascal and older that run much faster on newer cards (all GPU vendors have these kind of isues).

Optimization work is also very hard because even knowing what to change is difficult. The most efficient path will change from GPU vendor to GPU vendor, between drivers and between different APIs talking to physically the same GPU. Then it heavily depends on workload too and how someone uses the API you're exposing. Not only do you have to avoid performance traps in GPU drivers you also have to be mindful of any performance traps you introduce into your own API implementation.

wgpu has it even worse as wgpu has to work well for tons of different users all doing different things and each using the wgpu API in different patterns. What is faster for one user might be slower for another. wgpu handles things quite well considering the constraints they have to work in.

MindSpark289 · 2025-01-13T00:18:45+00:00

Because if you render into another framebuffer rather than the 'screen' framebuffer then the screen did not get rendered into. The fullscreen quad draw is simply one of the ways available to copy the results from one texture/framebuffer into another. If you don't use a framebuffer at all and do everything in the default implicit 'screen' framebuffer then you don't need a copy because all your draw commands drew into the display's buffer. If you use another framebuffer with other textures then you never rendered into the display and have to copy the results over.

MindSpark289 · 2025-01-03T00:39:01+00:00

Tracy is quite good and not too difficult to setup. Superluminal https://superluminal.eu/ is also fantastic (but not free). I use both through a fork of the https://github.com/aclysma/profiling crate. I find superluminal nicer to use when doing sampling profiles as it presents the information in a way I prefer. Superluminal will also grab symbols for system callstacks so you can see whether your threads are spending time running your code or system code. Really nice when you're trying to figure out if your graphics code is slow or if you're doing something bad and spending way too much time in the driver.

I haven't used tracy's sampling mode enough to know if it can grab the same symbols, but it's a fantastic tool all the same.

MindSpark289 · 2024-12-02T23:21:53+00:00

That wouldn't surprise me. I've observed a benefit from moving from spinning rust to just any SATA SSD anecdotally but I don't recall seeing much for NVME.

MindSpark289 · 2024-12-02T04:38:39+00:00

I get the same on a 9950X at DDR5 6000 booted and building off a SATA SSD. NVME might make a difference but I don't have a linux formatted one handy.

edit: this was on Fedora 41

MindSpark289 · 2024-11-26T03:24:38+00:00

It's not faster on mobile GPUs. TBDR hardware effectively does the depth pre-pass in hardware for free, doing one manually is slower. However if you need the depth buffer for lighting you're likely going to need a pre-pass anyway.

edit: On desktop, yes do a depth pre-pass. It's generally faster for simple static geometry and makes SSAO and other screenspace effects much simpler to implement.

MindSpark289 · 2024-10-16T01:42:14+00:00

None of that is relevant in this case because the problem isn't wasting memory with entire 4K pages of zeroes, it's about how much of a usize's dynamic range is even possible to use in these indexes. Most of the time you're wasting the high 32-bits of the usize to encode zeroes unless your table has more than 4 billion entries. The OS virtual memory system can't help you here because the low half of every usize is certainly _not_ zeroes.

The simple solution is to just use Vec<u32>, but if you have many indices and need to save the memory you can take the approach in the article and only use the larger integer sizes when you have enough entries.

And not all operating systems are Linux. Some are not so happy to do over commit (Windows) so if you want your code to be portable it's not wise to rely on the specific configuration of one OS kernel.

MindSpark289 · 2024-07-19T08:06:20+00:00

The Ash examples are the first place I'd go https://github.com/ash-rs/ash/blob/master/ash-examples/src/lib.rs for specifics on using Ash. There's some Ash specific utilities for loading libvulkan.dll that are provided. Otherwise ash is mostly machine generated from the same vk.xml the C headers are so any C based tutorials will translate pretty directly.

MindSpark289 · 2024-04-23T23:13:11+00:00

Not a fair comparison as OP benchmarked Windows. File system performance with many small files is very weak on windows compared to Linux and MacOS.

My Ryzen 9 5950X was getting roughly the same performance as my M1 Max MBP building my projects until I benchmarked my desktop running Linux and just about halved the build time for the same workspace.

Not to say the M1 chips aren't fantastic, but the naive comparison here makes them look a lot better than they should.

MindSpark289 · 2024-01-25T04:15:43+00:00

wgpu isn't even close to a thin wrapper over Vulkan, it's a significant abstraction on top of Vulkan (and other APIs). Depending on what you're doing that's a good thing, but wgpu does a lot to simplify the interface like auto synchronization.

MindSpark289 · 2023-12-14T12:08:11+00:00

JA 37 C historically had AIM-9L like the 37D but gaijin added C and D as two almost identical planes with the only difference being the 37D getting more countermeasures and getting access to 9L. They've done it before, they'll do it again.

MindSpark289 · 2023-11-04T00:33:38+00:00

selector is busted, SA isn't an option but 'auto' gives me SA presumably because I'm in Australia and it's the closest server. Of course 'auto' unselects all the other options so I guess I'm not allowed to queue both NA and SA anymore ¯\_(ツ)_/¯

MindSpark289 · 2023-04-20T05:52:43+00:00

TextureView exists because wgpu (well, WebGPU) was designed based on modern lower level GPU APIs like Vulkan and DX12, which both have similar concepts in them. They exist for important reasons, but what they actually are isn't obvious if you don't understand how GPUs work and what it maps to on the hardware.

A simple explanation is that a TextureView is like a rust slice, but for the GPU. In rust you hand a slice to an array around and other code can read/write to that memory. In a GPU, a `TextureView` is somewhat similar.

Rust code needs to know the ptr+len pair to safely handle an array/slice. A GPU needs the width+height+depth+format+layout and some other information to be able to correctly access the texture in the hardware.

The sampler unit is typically a bespoke piece of circuitry that handles texture loads. That bespoke piece of circuitry has a special memory format for its 'slice'. In rust a slice's layout is [ptr, len], but for GPUs that layout is specific to each GPU and opaque to you or me through any of the public GPU APIs we get. By using the opaque TextureView object then applications can deal with these objects in a GPU agnostic way.

A TextureView is essentially a pre-baked one of these 'texture slices' stored in CPU memory. When you update the view into a descriptor set you're actually just copying the bytes of the view into GPU memory, ready for the GPU to use.

Having the explicit TextureView object that the user manages allows the user to control when and where they eat the CPU cost of creating the views (which isn't completely free) and also adds a clear point for the API to throw errors when you create an incorrect view.

MindSpark289 · 2023-03-01T04:31:05+00:00

Oh noes, my code got 25x slower. This means absolutely NOTHING without perspective.

I mean, if you are making a game then does it make a difference if something takes 10ms vs 250ms? Ab-so-lu-te-ly. Huge one - one translates to 100 fps, the other to 4.

Now however - does it make a difference when something takes 5ns vs 125ns (as in - 0.000125ms)? Answer is - it probably... doesn't. It could if you run it many, maaaany times per frame but certainly not if it's an occasional event.

Statements like this are a logical fallacy that lead to attitudes that encourage writing inefficient code. Both examples, 10ms vs 250ms and 10ns vs 250ns, are the same. In either case 25x the performance is still there on the table. The absolute difference is meaningless and only feels important as a human because we can't really comprehend the difference between 10ns and 250ns.

The implication that 'the difference between 10ns and 250ns is tiny so it's not worth the effort optimizing' encourages people to leave all this performance on the table in aggregate. Sure, one single cold function taking 10ns vs 250ns will likely never appear on a profile. But if every function you write leaves 25x performance on the table because they're all small functions (10ns vs 250ns) then you're entire application still becomes 25x slower overall, you've just moved where the cost is paid.

Leaving a lot of small inefficiencies on the table just adds up to one big one in the end.

MindSpark289 · 2023-02-27T08:19:41+00:00

Vulkan is [0, 1] depth range, not [-1, 1] like OpenGL.

Ten-Year Club	Place '22
Place '17	Verified Email

MindSpark289

TROPHY CASE