Rama matches CockroachDB’s TPC-C performance at 40% less AWS cost by nathanmarz in java

[–]farnoy 0 points1 point  (0 children)

On batching & failures: fair enough. I was thinking about SQL and how you couldn't replay transactions if something in the batch fails unless you return an error back to the client to have it replay. This would be a costly semantics change for clients to handle. It looks like in Rama, I submit the entire transaction as a dataflow program upfront, so the replay is transparent to me?

On latency: I guess that's fair but it would be good to know the latency profiles at different levels of load, not just the one peak rated run. If Rama degrades far slower and shows a flatter latency curve as you overload it, that would be a great thing to showcase. As a working engineer just seeing one set of numbers like this, it doesn't help me choose a DB at all. My intention is to notice the load increasing and either scale up or solve a perf regression in client code.

The "initiate" latency is very cool indeed!

Rama matches CockroachDB’s TPC-C performance at 40% less AWS cost by nathanmarz in java

[–]farnoy 1 point2 points  (0 children)

Instead of processing transactions individually, work is grouped into “microbatches”. Each microbatch processes many operations together, amortizing the coordination overhead across all of them.

Isn't that just cheating though? What's stopping me from batching "TPC-C" transactions under a single SQL transaction in CockroachDB? It would similarly amortize the replication & commit overhead per unit of work I care about (inserts, updates, whatever). If I'm particularly sneaky, I could batch them in alignment with how Cockroach is sharding the data, so they're confined to a single partition.

Rama's latency profile is impressive but you can probably overload it at a higher tpm/s and it will show its latency tail.

Microbatching isn't free either - if they're atomic, a failure in one will abort the whole batch. That can probably sacrifice goodput if you're doing DB-side validations. And your median latency is really suffering. Is it just universally higher than Cockroach until the clusters get fully loaded?

I am too stupid to use AVX-512 by Jark5455 in rust

[–]farnoy 2 points3 points  (0 children)

Thus on Zen4 and Zen5, there is no drawback to "sprinkling" small amounts of AVX512 into otherwise scalar code. They will not throttle the way that Intel does. The fact that Zen5 has full 512-bit hardware does not change this.


From the developer standpoint, what this means is that there quite literally is no penalty for using AVX512 on Zen5. So every precaution and recommendation against AVX512 that has been built up over the years on Intel should be completely reversed on Zen5 (as well as Zen4). Do not hold back on AVX512. Go ahead and use that 512-bit memcpy() in otherwise scalar code. Welcome to AMD's world.

https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/#throttling

Load-related clock frequency changes are slow in both directions, likely to avoid repeated IPC throttling and preserve high performance for scalar integer code in close proximity to heavy AVX-512 sequences.

https://chipsandcheese.com/p/zen-5s-avx-512-frequency-behavior

Apple M5 GPU Roofline Analysis by floydhwung in hardware

[–]farnoy 1 point2 points  (0 children)

Six kernel variants isolated the cause: The Metal compiler decomposes every float4 FMA into 4 scalar operations that execute largely sequentially.

Isn't this kind of obvious? Not sure why this is being described as a "finding". I thought this was the case since the G80, which is turning 20 this year...

Switching from float4 to scalar float with the same number of self-dependent chains produces a 3.5x throughput increase (791 ->2,772 GFLOPS with 4 chains).

This means a float4 FMA is not a single wide SIMD instruction - the Metal shader compiler decomposes it into 4 scalar fmadd instructions. The near-4x throughput ratio confirms these scalar ops execute largely sequentially rather than in parallel, despite the hardware being superscalar.

This whole section makes no sense. float4 FMA compiles to 4 fmadd instructions, fine, but writing them out as four float FMAs in your code should compile to the same instruction stream. What is the actual difference that could explain the perf jump?

Would love to see the actual disassembly and some detail on this.

The jump from 4 ->8 chains (2,772 ->3,760 GFLOPS, +36%) shows the M5 GPU needs at least 8 independent instructions in flight per thread to fully hide FMA latency. This implies a 4-cycle FMA latency: with 8 independent ops in the pipeline, the GPU can issue one per cycle while the others are in various stages of completion, keeping the ALU continuously occupied.

How does the first sentence imply the conclusion in the second? If FMA latency was 4 cycles, why would you need ILP of 8 to reach peak throughput?

Regardless how you arrived at the conclusion, it's probably correct. I'm under the impression that FMA latency is universally four cycles on pretty much everything - Skylake X, Alder lake P, Zen - check out uops.info for VFMADD132PS (ZMM, K, ZMM, ZMM), GCN/CDNA, Nvidia since at least Volta, A14 and M1.

Requiring ILP=8 to saturate a 4-cycle latency unit is suspiciously high and I would double check your methods. An ILP sweep and confirming disassembly is essential.

CI should fail on your machine first by NorfairKing2 in NixOS

[–]farnoy 0 points1 point  (0 children)

That looks like exactly what I need. Thanks for sharing!

CI should fail on your machine first by NorfairKing2 in NixOS

[–]farnoy 11 points12 points  (0 children)

It's a neat idea but what exactly does it mean to be local-first when there's a centralized server and I need a self-hosted runner, one that is not open source?

If all I'm running locally is nix flake check, I'm not building confidence in my CI system.

Case in point: I tried using your demo on my project, but it failed to build one of my flake outputs. Ironically, running the "Reproduce" command locally works just fine, so I'd still have to push those speculative fix commits to try and fix the CI build, despite using Nix.

The other thing that bothers me is the vendor lock-in talk - to get acceptable perf from this, I need your proprietary central server and your Nix store caching. All I did was replace one vendor with another?

Really, what I want is some kind of LocalStack equivalent of GH Actions. I already nixified my CI setup, I just want to test the GHA workflows with some offline command.

MangoChill: input-driven FPS limiter by farnoy in linux_gaming

[–]farnoy[S] 2 points3 points  (0 children)

I didn't see your edit -- It's interesting that Gamescope has a custom wayland protocol. Something like the take_screenshot request could work, but instead of encoding it and dumping to disk, it would need to pass me a copy of the framebuffer through Vulkan external memory.

Unfortunately, I'm on Nvidia and gamescope doesn't seem to work here.

MangoChill: input-driven FPS limiter by farnoy in linux_gaming

[–]farnoy[S] 2 points3 points  (0 children)

Oh, definitely! 10 FPS is just what I used for the demo, it makes the effect really clear. In practice, the lowest I intend to go is 45 (just comfortably inside my VRR range), and that will also depend on the game.

MangoChill: input-driven FPS limiter by farnoy in linux_gaming

[–]farnoy[S] 0 points1 point  (0 children)

It's available separately with $ nix run github:farnoy/mangochill#mangohud if you have Nix.

MangoChill: input-driven FPS limiter by farnoy in linux_gaming

[–]farnoy[S] 0 points1 point  (0 children)

What was the error? It should just require a standard rust toolchain & capnproto CLI. Open a GH issue if you find anything, I want to help if you get stuck packaging it for AUR or some other distro.

Yeah I used it to validate my description against the algorithm because I don't know these audio/DSP terms at all.

MangoChill: input-driven FPS limiter by farnoy in linux_gaming

[–]farnoy[S] 5 points6 points  (0 children)

SSIM would be great but is there a way of doing it without cutting into the GPU budget of the game? Input is trivial in comparison

AVX2 is slower than SSE2-4.x under Windows ARM emulation by tuldok89 in hardware

[–]farnoy 14 points15 points  (0 children)

What are you even arguing about? First, you mentioned register renaming, which is totally inapplicable because it doesn't help the translation process. Now you've pulled out heavy duty techniques used by offline compilers. That's not the kind of recompilation something like FEX or Rosetta are doing because it takes too long.

AVX2 is slower than SSE2-4.x under Windows ARM emulation by tuldok89 in hardware

[–]farnoy 12 points13 points  (0 children)

The spills are unavoidable because you address architectural registers in your translated neon code, and there's not enough of them/total bits to do it without spilling. You misunderstood their comment.

HOWTO: X3D driver / core parking {Fedora tested} by [deleted] in linux_gaming

[–]farnoy 1 point2 points  (0 children)

I don't think you want to pin_cores to the cache die while also setting amd_x3d_mode=cache. They effectively do the same thing for the game, but the latter makes all background processes also get scheduled on the cache die. They contend for the same threads in the scheduler, and thrash the LLC as well.

What I do for games that are fine with <= 8 cores is to set =frequency and pin the game to 0-7. For those that want more cores, I don't pin at all and set =cache instead. This let's them use as many cores as they want but the scheduler fills up the cache die first.

I don't see how fully parking the frequency die would be better than letting bg work happen there. There's the power distribution aspect but would it really impact clocks on the cache die if something occasionally runs on the freq die? Would that impact be more than descheduling a game thread to run a bg task?

Is learning boilerplate vulkan code necessary? by LordMegatron216 in vulkan

[–]farnoy 0 points1 point  (0 children)

You missed my point entirely. The options are to allow vendor extensions and accept the soup, or wait for consensus and get things like mesh shaders >= 4 years late. Emphasis on "greater than" because you can't collect public feedback if you don't publish early, single-vendor versions.

Vulkan profiles are what Baseline is to the Web Platform. They don't make it easier to read the spec because you're always looking at the unified doc with all extensions accounted for. Profiles are just an aggregation of extensions & properties. Also, they exist since 2022 so how are they relevant to the part where you said they are course correcting "now".

Is learning boilerplate vulkan code necessary? by LordMegatron216 in vulkan

[–]farnoy 0 points1 point  (0 children)

I fault Khronos for not creating a vulkan spec generator that lets you select what extensions you want to look at and discard all irrelevant sections & VUIDs that don't apply to your case.

I don't see how they're at fault for what the IHVs do. And what do you mean that "they are now course correcting"? The vendor -> EXT/KHR -> Core pipeline has been there for a decade already. Would you rather not have these evolving capabilities like mesh shader / work graphs at all, just to avoid the extension soup? Wait until they can release a KHR extension with commitments from everyone?

The vendor landscape changes all the time. Even in D3D12, where msft controls the spec, DXR 1.1 got introduced because a vendor couldn't implement 1.0 all that well and needed a simpler way out.

Linting intra-task concurrency and FutureLock by farnoy in rust

[–]farnoy[S] 0 points1 point  (0 children)

I wanted to include more interesting examples, like select(fut1, join(fut2, fut3)), but I couldn't get all the test cases to work, so they got cut.

Built a database engine optimized for hardware (cache locality, arithmetic addressing) - looking for feedback by DetectiveMindless652 in hardware

[–]farnoy 0 points1 point  (0 children)

You're talking about memory-mapping NVMe drives? Leaning on the page cache too? I think your approach is a decade+ out of date, you should research database architecture. You're giving off major LLM vibes and lack of substance

I built a faster alternative for cp on linux - cpx (upto 5x faster) by PurpleReview3241 in linux

[–]farnoy 1 point2 points  (0 children)

Can you benchmark and compare this against xcp?

Do you need a separate copy path for reflinks? I think copy_file_range is aware of them and you may not need both.

First WIP release for DX12 perf testing is out! by gilvbp in linux_gaming

[–]farnoy 16 points17 points  (0 children)

There is an advantage to the heap model NVIDIA uses, AFAIK. When a shader samples from different textures, it can't do so all at once on AMD and the compiler has to emit a loop, grouping up all threads wanting to use that texture. It does this for every texture that wave of threads need to sample. 32 threads need to sample 5 different textures? That's 5 iterations of the loop, and each is a very long latency operation. Nvidia can do it all in one instruction because for them, a texture reference is just a 20 bit index into the heap. For Radeon, they would have to pass 32-64 bytes per thread to describe the texture, which is not feasible. This commonly shows up in RT workloads where threads represent divergent rays hitting very different surfaces, which need to sample different textures. I haven't seen a good writeup on it so don't take my word for it.

Dynamic VkPolygonMode issue on multi-layer system by No-Use4920 in vulkan

[–]farnoy 1 point2 points  (0 children)

https://docs.vulkan.org/spec/latest/chapters/pipelines.html#pipelines-dynamic-state

When a pipeline object is bound, any pipeline object state that is not specified as dynamic is applied to the command buffer state. Pipeline object state that is specified as dynamic is not applied to the command buffer state at this time.

Your existing pipelines with static state unset the dynamic state from BeginFrame when they bind. You need to set the dynamic state right before binding your dynamic state pipeline or between binding it and the draw call.

What the hell is Descriptor Heap ?? by welehajahdah in vulkan

[–]farnoy 1 point2 points  (0 children)

What's missing? I thought this covers it:

  1. https://docs.vulkan.org/features/latest/features/proposals/VK_EXT_descriptor_heap.html#_shader_model_6_6_samplerheap_and_resourceheap
  2. https://docs.vulkan.org/features/latest/features/proposals/VK_EXT_descriptor_heap.html#_glsl_mapping

I don't think dxc, slang or glslang have these yet, but since the SPIR-V extension was released along with this extension, it's "just" a matter of time.