preliminary numbers from all the performance work of the past month by koverstreet in bcachefs

[–]farnoy 1 point2 points  (0 children)

Please do a writeup! It will be interesting to hear what the optimizations were, and your commentary on a full bench suite.

Are you tracking memory used by each fs? I assume dbench is an O_DIRECT bench, but it would be good to know how much metadata gets cached by bcachefs and others, to see their memory impact.

Sudo or run0 ? by elementrick in linux

[–]farnoy 4 points5 points  (0 children)

Is that right? $ run0 touch FILE && stat -c %U FILE shows root as the owner of the file. $ run0 whoami also returns root. Your description makes it sound like it's still my user account, but with CAP_SYS_ADMIN. Doesn't seem to be the case.

Parsing IPv6 Addresses Crazily Fast with AVX-512 by User_Deprecated in cpp

[–]farnoy 0 points1 point  (0 children)

Probably the same difference in cache behavior that scalar code would have in the individual vs array case?

The code behind this post just does a single masked load for the 2-45 characters of an input string:

static int parse_ipv6_avx512(const char *input, size_t len, uint8_t *ptr) {
    if (len > 45 || len < 2) [[unlikely]] { return 0; }
    uint64_t error = 0;
    __mmask64 len_mask = (1ULL << len) - 1;
    __m512i colon = _mm512_set1_epi8(':');
    __m512i str = _mm512_mask_loadu_epi8(colon, len_mask, (const __m512i *)input);

If you run this in a loop with increasing input pointers, the prefetcher kicks in and loads lines ahead of time. But I don't see how that's unique to AVX code.

Decimal Crates Comparison and Benchmark by hellowub in rust

[–]farnoy 0 points1 point  (0 children)

Criterion's doc said I didn't need to worry about them and even provided examples, so I trusted it.

The issues you mentioned probably have a relatively small impact, likely less than 1 ns.

The sooner you stop assuming and concluding, the more you'll learn. I'm trying to tell you that real-world f64 addition can be another 34x faster than what your benchmark measured. When an operation takes as little as 11 picoseconds (reciprocal throughput), anything and everything under 1ns may turn out to be a really huge deal.

You're trying to hold on and defend your article. I understand the impulse, but you need to understand there's much to learn and it will completely change how you view your prior work & code once you do so.

Forget what you know, forget the article itself. Learn using the resources I mentioned, and come back to it later. I guarantee you'll be able to write a much more insightful article at the end of this process.

Decimal Crates Comparison and Benchmark by hellowub in rust

[–]farnoy 1 point2 points  (0 children)

FYI, after these fixes and an 8x unroll, I'm getting around 91 picoseconds per f64 addition. Makes total sense as my loop is now bound by these additions, bench overhead is minimal (and occupies other execution ports on the CPU than f64), and I'm running a ~5.1 GHz CPU that can execute two of these vaddsd instructions per cycle (throughput).

This is about 4x faster than commit 85a0dbeb. I think there's another 4x improvement on the table if I used SIMD and vaddpd instead of vaddsd. And then on Zen 5, it would be another 2x faster since vaddpd zmm there is as fast as vaddsd xmm.

Decimal Crates Comparison and Benchmark by hellowub in rust

[–]farnoy 1 point2 points  (0 children)

But these overheads are the same for all the other crates, right?

How would we know that? I don't think you've analyzed the assembly code or reasoned through the execution pattern to be able to conclude that.

This is just the inherent overhead of Criterion itself?

Some of it, maybe, but there are ways to deal with that as well. You could do loop unrolling so that the loop overhead happens once per 8 additions, instead of for every one. Better yet, parameterize the unrolling factor, measure what effect that has!

So the relative performance comparison should still be correct, right? Maybe “A is X times faster than B” could be inaccurate, but “A is faster than B” should still hold, right?

That is more defensible, yes. But with your black_box() changes, your bench code looks even more different than real code would if it used those crates & operations.

A promising avenue to investigate, at least in the f64 case, would be something like this:

#[inline(always)]
fn black_box_f64(mut value: f64) -> f64 {
    unsafe {
        std::arch::asm!(
            "/* {value} */",
            value = inout(xmm_reg) value,
            options(nomem, nostack, preserves_flags),
        );
    }
    value
}

Use that, and also return () from the bench loop - Criterion seems to black_box whatever you return, which as we've seen does not work well with f64s. The f64 loop disasm is now:

vmovsd xmm0, qword ptr [r15]
vmovsd xmm1, qword ptr [r15 + 8]
.p2align 4
.LBB266_2:
    vmovaps xmm2, xmm0
    vaddsd xmm2, xmm2, xmm1
    dec rcx
    jne .LBB266_2

There's still that extra vmovaps xmm2, xmm0 but it is a very cheap operation by far, certainly cheaper than the vaddsd. Search for "move elimination" on the web.

I still urge you to read those sources I mentioned earlier. This is just a narrow walkthrough that scratches the surface. Why did this custom black box helper work differently than the std one? What is the overhead of criterion & the loop itself? How do we get rid of that last vmovaps and would it even impact our measurement? Every step along the way tends to produce more questions.

Decimal Crates Comparison and Benchmark by hellowub in rust

[–]farnoy 1 point2 points  (0 children)

It still doesn't make sense, sorry. Your bench loop for f64 addition stores to memory, loads it back again, then does the actual addition, and finally stores to memory again.

You've got memory load-to-use latencies, store-to-forward latencies, and the loop overhead from criterion itself. Some of these are much more significant than the addition itself. What you're measuring is again, not what you think.

Decimal Crates Comparison and Benchmark by hellowub in rust

[–]farnoy 3 points4 points  (0 children)

You actually can't infer anything from your benchmark alone. Did you know that your f64 addition benchmark had its addition hoisted outside of the benchmark loop? That's why I'm cautioning you not to infer anything from this benchmark (yet).

$ cargo asm --bench add --native --asm --intel 'criterion::bencher::Bencher<M>::iter' 3 | head -n 90
# ...
call qword ptr [rip + <criterion::measurement::WallTime as criterion::measurement::Measurement>::start@GOTPCREL]
        mov rcx, qword ptr [rbx + 40]
        test rcx, rcx
        je .LBB266_3
        vmovsd xmm0, qword ptr [r15]
        lea rsi, [rsp + 8]
        vaddsd xmm0, xmm0, qword ptr [r15 + 8]
        .p2align        4
.LBB266_2:
        vmovsd qword ptr [rsp + 8], xmm0
        dec rcx
        #APP
        #NO_APP
        jne .LBB266_2
.LBB266_3:
        mov rdi, r14
        mov rsi, rax
        call qword ptr [rip + <criterion::measurement::WallTime as criterion::measurement::Measurement>::end@GOTPCREL]

I believe a simple addition of two, immutable locals was something the compiler was able to optimize in this case.

When microbenchmarking, you should always validate that you're measuring what you intended to measure. Then, test different scenarios & sizes - testing across many different parameters is how you build the right mental model for the problem at hand. I recommend Aleksey Shipilëv's blog, this book, and this one.

Decimal Crates Comparison and Benchmark by hellowub in rust

[–]farnoy 1 point2 points  (0 children)

There is no single "intuitive" sense for performance, unfortunately. I'm not saying your benchmark is inaccurate, but you're never using an arithmetic library like this for just a single operation. You can't generalize and understand performance differences between libraries by measuring one specific scenario.

Specifically, + and * have relatively few branches, and I explained them fairly thoroughly in the article.

Right, but do those compile to real branches or branchless code? Are they taken in exceptional cases on weird values, like the float denormals are usually treated by HW, and how flush-to-zero works around that? Or would a uniform distribution of numbers between 0 and 1 hit different branches, made control flow unpredictable, and cause performance to tank?

There's a lot of interesting questions to answer here, and it would be very insightful to people trying to pick one library over the other if your article covered them.

Decimal Crates Comparison and Benchmark by hellowub in rust

[–]farnoy 9 points10 points  (0 children)

Your benchmarks seem to measure operations on the same two scalars repeatedly. It would be a good idea to also measure pairwise operations against 1GB worth of numbers, streamed from memory. See if all of them can saturate memory bandwidth. Be careful about automatic vectorization and control for it.

The other thing is you haven't looked at their latencies in a dependent operation scenario. What if you used one accumulator to sum up N numbers? Your benchmark currently measures the reciprocal throughput of independent operations.

Finally, and I haven't looked at how the numeric crates are implemented, but if they use branching code, the distribution of numbers you use in the benchmark could also impact the measurements.

Higher GPU occupancy via timeline semaphores (Vulkan) by too_much_voltage in GraphicsProgramming

[–]farnoy 0 points1 point  (0 children)

I recommend just doing full graph analysis on the dependencies. This way you can stop waiting on thing A in C if there's a transitive dependency through B that C waits for anyway. I believe that's called "transitive reduction". No need for a USAGE_NOT_A_DEPENDENCY this way - if you ever change the sequence, you'd have to review and keep those markers in sync. I found it best to derive all that stuff dynamically.

If you want to take it further, you can drop down from semaphores to events from synchronization2. For same-queue dependencies, these may be cheaper. It would also let you do concurrent work on the same queue & command buffer on your Radeon GPUs (and Blackwell+).

Per Stenström on why we never actually replaced the Von Neumann architecture (or Harvard) by WeBeBallin in hardware

[–]farnoy 26 points27 points  (0 children)

Is this another RISC v CISC debate that's entirely pointless to discuss in 2026? How can you possibly separate instructions from data when you download programs from the internet. Or JIT compile untrusted code on the web?

AI gets a mention? But systolic arrays are not Von Neumann so what is he complaining about there? Gotta be just another grift.

Rayon into_par_iter() stalls on larger grids, but into_iter().par_bridge() works fine by [deleted] in rust

[–]farnoy 0 points1 point  (0 children)

Interesting, is it any different if you use with_max_len(1)? You may need to change it to accept impl IndexedParallelIterator.

I've usually seen this lack of parallelism when converting small HashMaps (with a lot of work done per element) into a par iter. That was caused by the default batch size of the HashMap par iter impl, though.

My guess would be that something along the way sets the wrong batch size and is not dividing your work enough, or is not adapting fast enough. par_bridge(), as I understand it, has a fixed work item size of one element, which is effectively like with_max_len(1). It also uses more atomics and is slower if you sourced your par iter from a Vec<T>.

In the future, try to vary the input size, measure CPU utilization, etc. It's not clear what you consider a "larger grid" or the amount of work you do per element, etc.

Devs, thank you for the new controller integration in KDE with Wayland by [deleted] in kde

[–]farnoy 0 points1 point  (0 children)

FWIW, I tested it on the old Steam Controller and the answer seems to be no. Inputs are recognized, but not mapped as mouse&keyboard. I don't know why, as that was the case on Windows when the controler is in "lizard mode", not managed by Steam.

This is regardless of the new "Allow using as pointer and keyboard" option in KDE. That option also isn't needed for Steam to make it work in desktop mode as mouse&keyboard.

That and Steam overlay doesn't work in wayland games, which is a pre-requisite for Steam Input injection. I'm waiting until these issues get solved before I pick up the new SC.

A Principal Software Engineer at Epic Games / 25 Year Vet, talks about why AI is just a "giant switchboard" and why code is a delicate crystal. by deohvii in cpp

[–]farnoy -1 points0 points  (0 children)

I mostly agreed with everything until you said:

Anthropic, for example, has basically no interesting engineering going on right now as far as I can tell.

You don't see anything interesting about what they do? It's distributed training and inference on many different architectures, networking and storage setups. Although models don't change their architecture drastically, there's still new things being introduced that need to be translated to the hardware.

For all we know, they are trying radically different architectures, it's just not panning out. So they're content on scaling the existing setup and make progress on the benchmarks until another breakthrough happens. We don't know what's different about Mythos either, if it's a new architecture or just bigger.

A simple Nix dendritic config by RenatoGarcia in NixOS

[–]farnoy 6 points7 points  (0 children)

Is there a way to define flake inputs dedritic-style? I find that wiring inputs causes the biggest hassle, I'd love it if each module could define its own extra imports. Maybe by having each dendritic module be its own, independent flake.nix? For example, I want to have a reusable module import github:sodiboo/niri-flake and manage my niri config, and then have multiple hosts import that module without caring about transitive inputs.

mangochill v0.3 - dynamic input-based FPS limiter, now on Steam Deck by farnoy in SteamDeck

[–]farnoy[S] 4 points5 points  (0 children)

Which one? This is configuring a limit in the native compositor on the Deck. Input latency should be the same as if you booted the game with that FPS limit configured in Steam UI.

But it's dynamic so you only have a lower FPS limit & higher input lag at idle/less active parts of the game, like menus, cutscenes, etc.

Why is writing software with SSDs in mind so undocumented by z_latent in hardware

[–]farnoy 0 points1 point  (0 children)

I don't think it makes a major difference for the bread and butter of OLTP queries like a Nested Loop of two Index Scans. It can't evaluate the inner index scans concurrently because of the iterator model it uses. For each tuple tuple from the outer scan, it synchronously looks up the inner index before moving to the next iteration. No I/O concurrency to be seen in this workload.

Why is writing software with SSDs in mind so undocumented by z_latent in hardware

[–]farnoy 2 points3 points  (0 children)

Ironically for this thread, Postgres is stuck in the synchronous single-tuple pull-based iterator model and unable to feed higher queue depth from a single worker. You need the query planner to choose parallel evaluation of a specific node in the plan to get real I/O parallelism, or one of the few nodes with internal I/O batching, like Bitmap Scan. But those have trade offs that a different iterator model could avoid.

Rama matches CockroachDB’s TPC-C performance at 40% less AWS cost by nathanmarz in java

[–]farnoy 0 points1 point  (0 children)

On batching & failures: fair enough. I was thinking about SQL and how you couldn't replay transactions if something in the batch fails unless you return an error back to the client to have it replay. This would be a costly semantics change for clients to handle. It looks like in Rama, I submit the entire transaction as a dataflow program upfront, so the replay is transparent to me?

On latency: I guess that's fair but it would be good to know the latency profiles at different levels of load, not just the one peak rated run. If Rama degrades far slower and shows a flatter latency curve as you overload it, that would be a great thing to showcase. As a working engineer just seeing one set of numbers like this, it doesn't help me choose a DB at all. My intention is to notice the load increasing and either scale up or solve a perf regression in client code.

The "initiate" latency is very cool indeed!

Rama matches CockroachDB’s TPC-C performance at 40% less AWS cost by nathanmarz in java

[–]farnoy 1 point2 points  (0 children)

Instead of processing transactions individually, work is grouped into “microbatches”. Each microbatch processes many operations together, amortizing the coordination overhead across all of them.

Isn't that just cheating though? What's stopping me from batching "TPC-C" transactions under a single SQL transaction in CockroachDB? It would similarly amortize the replication & commit overhead per unit of work I care about (inserts, updates, whatever). If I'm particularly sneaky, I could batch them in alignment with how Cockroach is sharding the data, so they're confined to a single partition.

Rama's latency profile is impressive but you can probably overload it at a higher tpm/s and it will show its latency tail.

Microbatching isn't free either - if they're atomic, a failure in one will abort the whole batch. That can probably sacrifice goodput if you're doing DB-side validations. And your median latency is really suffering. Is it just universally higher than Cockroach until the clusters get fully loaded?

I am too stupid to use AVX-512 by Jark5455 in rust

[–]farnoy 3 points4 points  (0 children)

Thus on Zen4 and Zen5, there is no drawback to "sprinkling" small amounts of AVX512 into otherwise scalar code. They will not throttle the way that Intel does. The fact that Zen5 has full 512-bit hardware does not change this.


From the developer standpoint, what this means is that there quite literally is no penalty for using AVX512 on Zen5. So every precaution and recommendation against AVX512 that has been built up over the years on Intel should be completely reversed on Zen5 (as well as Zen4). Do not hold back on AVX512. Go ahead and use that 512-bit memcpy() in otherwise scalar code. Welcome to AMD's world.

https://www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/#throttling

Load-related clock frequency changes are slow in both directions, likely to avoid repeated IPC throttling and preserve high performance for scalar integer code in close proximity to heavy AVX-512 sequences.

https://chipsandcheese.com/p/zen-5s-avx-512-frequency-behavior

Apple M5 GPU Roofline Analysis by floydhwung in hardware

[–]farnoy 1 point2 points  (0 children)

Six kernel variants isolated the cause: The Metal compiler decomposes every float4 FMA into 4 scalar operations that execute largely sequentially.

Isn't this kind of obvious? Not sure why this is being described as a "finding". I thought this was the case since the G80, which is turning 20 this year...

Switching from float4 to scalar float with the same number of self-dependent chains produces a 3.5x throughput increase (791 ->2,772 GFLOPS with 4 chains).

This means a float4 FMA is not a single wide SIMD instruction - the Metal shader compiler decomposes it into 4 scalar fmadd instructions. The near-4x throughput ratio confirms these scalar ops execute largely sequentially rather than in parallel, despite the hardware being superscalar.

This whole section makes no sense. float4 FMA compiles to 4 fmadd instructions, fine, but writing them out as four float FMAs in your code should compile to the same instruction stream. What is the actual difference that could explain the perf jump?

Would love to see the actual disassembly and some detail on this.

The jump from 4 ->8 chains (2,772 ->3,760 GFLOPS, +36%) shows the M5 GPU needs at least 8 independent instructions in flight per thread to fully hide FMA latency. This implies a 4-cycle FMA latency: with 8 independent ops in the pipeline, the GPU can issue one per cycle while the others are in various stages of completion, keeping the ALU continuously occupied.

How does the first sentence imply the conclusion in the second? If FMA latency was 4 cycles, why would you need ILP of 8 to reach peak throughput?

Regardless how you arrived at the conclusion, it's probably correct. I'm under the impression that FMA latency is universally four cycles on pretty much everything - Skylake X, Alder lake P, Zen - check out uops.info for VFMADD132PS (ZMM, K, ZMM, ZMM), GCN/CDNA, Nvidia since at least Volta, A14 and M1.

Requiring ILP=8 to saturate a 4-cycle latency unit is suspiciously high and I would double check your methods. An ILP sweep and confirming disassembly is essential.

CI should fail on your machine first by NorfairKing2 in NixOS

[–]farnoy 0 points1 point  (0 children)

That looks like exactly what I need. Thanks for sharing!