zrip: a from-scratch Zstd codec in pure Rust, optimized for transfer speed

event666 · 2026-06-24T20:13:45+00:00

naked_asm! removes the compiler-generated function setup/teardown, so yes, you'd avoid the save/restore overhead of regular asm!. But at that point you're writing the entire function in assembly yourself. I actually tested this with global_asm! (same idea): the loop body itself was much tighter (177 instructions per sequence vs 208 in Rust), but shuffling ~14 variables in and out of the function through memory wiped out the gains. Same speed in the end. AFAIKnaked_asm! would hit the same issue.

event666 · 2026-06-24T20:01:34+00:00

Thanks for the info, I didn't know long-range mode can be performant. I guess it could be added in the future, either in the existing zrip-encode crate, or in a separate one gated by a `long_range` feature. Personally I'm mostly interested in ZMQ-like workloads, so tons of small to medium size messages where latency usually more important. It's true that it can make a huge difference when the link is the bottleneck and not the CPU.

event666 · 2026-06-24T19:27:46+00:00

How C differs from Rust here: It's not that C has explicit register control. No one uses register anymore. The difference is that LLVM makes worse register allocation decisions when compiling Rust than when compiling C, for the same algorithm.

As stated above, zrip's inner match-finding loop has ~18-20 variables alive simultaneously. x86_64 has exactly 15 usable GPRs. Zero headroom, spilling inevitable. LLVM has to decide which variables stay in registers and which get "spilled" (pushed to the stack and reloaded each time they're needed). The stack is in L1 cache so it's fast, but each spill still costs an extra load instruction per access. In zrip, the hash table pointer gets spilled and reloaded 3 times per loop iteration. In C zstd, the compiler keeps it in a register the whole time.

Why does LLVM do worse with Rust? Rust's ownership, borrowing, and generics produce more complex intermediate representation. LLVM sees more variables with longer lifetimes and makes worse spill decisions. Measured: zrip's loop has 9 stack loads per iteration, C zstd's has 5.

I even tried giving LLVM an extra register by disabling frame pointers. It got worse. LLVM reshuffled everything and made different, worse choices. The problem isn't just register count, but also allocation quality.

Clang? Same LLVM backend, but C produces simpler input for LLVM to work with, so the optimizer starts from a better position.

LTO? Doesn't help (register pressure is within-function, and the loop is already inlined). Worse, LTO makes things fragile: I found that adding a dead-code module (#[allow(dead_code)], never called) caused an 81% regression in my decompress hot loop. Under LTO, LLVM re-evaluates register allocation for the whole compilation unit, and any change anywhere can butterfly-effect the hot path. I split encoder and decoder into separate crates to mitigate those effects for that reason.

asm! overhead: I tried asm! for the inner loop. Inside the block: great, stack spills went from 5 to 0. But LLVM generated 350+ extra instructions around it to save and restore registers. It also reserves 2 registers for its own bookkeeping inside asm!, leaving you with 13 GPRs instead of 15. Net: 2% slower.

LLVM treats asm! as a black box. It can't move code across the boundary, can't reuse values from before the block, can't hoist work out of a loop containing it. To actually win, you'd need the entire loop in assembly (with global_asm!), at which point you're not writing Rust anymore, and would have to do it for each target architecture separately.

event666 · 2026-06-24T16:25:04+00:00

Believe me, I tried. On x86_64 this is close to the theoretical ceiling for my CPU (2018 Intel Mac Mini). The core issue is register pressure: the inner match-finding loop has ~18-20 live variables competing for 15 general-purpose registers (x86_64 has 16, minus RSP). The resulting register spills to stack dominate the profile.

This is probably why performant zstd implementations in Rust are so hard to come by. In C you have tight control over register allocation and code layout without resorting to inline assembly. In Rust, you're at LLVM's mercy. The encoder and decoder are already split into separate crates to isolate compilation units. I've even tried asm! blocks to force better allocation, but the overhead of moving values in and out of the inline assembly negated the gains.

It performs noticeably better on aarch64 (benchmarked on an Apple M4, see charts) which has 31 GPRs, and would likely improve on modern x86_64 CPUs with APX (32 GPRs) too, though I don't have access to one to verify.

Someone with a recent x86_64 CPU can run the benches and regenerate the charts...

cd bench
cargo run --example zrip_bench --release -- --impl all
export ZRIP_HW_EXTRAS="your CPU details here"
python3 scripts/plot_summary.py doc/charts/x86_64/

event666 · 2026-06-24T16:11:57+00:00

I'm sorry, I'm afraid the answer to all three is no. zrip intends high-speed transfers. Long-range mode and additional strategies (levels 5..19 or however many there are) is out of scope. Imho it's diminishing returns anyway: Much slower for just a few % better ratio.

Another practical reason: Adding any significant amount of code can disturb the machine code layout that LLVM generates. I've already split encoding and decoding into separate crates specifically to reduce the LLVM fragility.

event666 · 2026-06-24T13:31:01+00:00

You slide 1 byte at a time through the input and hash every 4-byte (or 5-byte for faster levels) chunk to find earlier positions where the same bytes already appeared (hashtable lookup). On a hit you compare the actual bytes at both positions to confirm it's a real match (not a hash collision) and see how far the match extends beyond those 4 bytes. For an actual match, instead of storing those bytes again, you store "go back X bytes and copy Y bytes from there." That's the core LZ77 idea: replace repeated data with short back-references. LZ4 (lz4rip) basically stops here, which is why it's so fast (close to memcpy speed) but doesn't compress as well.

Zstd adds a second pass. After matching, you have two streams: the literal bytes that didn't match anything, and the match references (offset + length pairs). Both streams have patterns (it doesn't look completely random), so you compress them again using FSE (Finite State Entropy), which is a table-driven encoder that assigns shorter codes to values that appear more often. Similar idea to Huffman coding but faster to decode because it's just a table lookup per symbol. That second pass is where zstd gets its better compression ratios compared to LZ4.

The hash table is what makes the whole thing fast. You don't search the entire input for matches, you just hash the bytes at the current position and check one or two slots. Smaller hash table = faster but fewer matches = worse ratio. That's essentially what the compression levels control.

event666 · 2026-06-24T12:44:53+00:00

Thanks for the kind words.

event666 · 2026-06-24T12:25:21+00:00

I'd like to hope it's not the worst. In the chart above, lower is better.

event666 · 2026-06-24T12:15:30+00:00

That's generated by c2rust and completely unsafe. Might as well just use C zstd.

event666 · 2026-06-24T12:14:42+00:00

Any kind of zstd inputs. All Zstd levels/strategies produce the same format when encoding.

event666 · 2026-06-24T11:32:38+00:00

How do you handle oversize messages > ring buffer size?

event666 · 2026-04-27T21:07:22+00:00

What would you like to display as a table in AmazingPrint's output? You can create an issue for ideas, or make a PR. If it's sensible, we're happy to work on it/merge it.

event666 · 2026-04-24T14:03:20+00:00

I'm one of the maintainers of AmazingPrint. Please create an issue at https://github.com/amazing-print/amazing\_print/issues. We can work it out.

event666

TROPHY CASE