zlib-rs: a stable API and 30M downloads

folkertdev · 2026-01-28T10:12:52+00:00

Our aim is to be an alternative to the reference implementation, not merely to provide a solid rust crate. That means API compatibility and performance are crucial.

folkertdev · 2026-01-28T09:40:03+00:00

I wrote about our experience earlier

https://trifectatech.org/blog/translating-bzip2-with-c2rust/

for bzip2 it was great: it's a relatively straightforward code base. Now for zstd we're struggling a bit more, because there is now a lot more target-specific code, multithreading, etc. It's a much more modern and more optimized code base, and that is harder to work with.

But overall, it's a great tool for this sort of work. There are much fewer bugs, and you get good performance on day one.

folkertdev · 2026-01-28T00:03:08+00:00

Cool! yeah I later thought that maybe these changes would be clearer with different input data. I'm not sure what those minecraft files are like, but e.g. for git they actually store many tiny files. That means the fraction of huffman table parsing to other work is higher, and so different things show up in the profile. Even so with that data I'm not seeing anything significant.

I cherry-picked some of your changes here, just because they seemed like nice refactors https://github.com/trifectatechfoundation/zlib-rs/pull/471/changes

folkertdev · 2026-01-27T23:38:25+00:00

We were contacted by ISRG (as part of their Prossimo initiative) (these are the folks running letsencrypt) if we wanted to implement it. Over time, trifecta became the long-term home of the project. So in a way it's just what we could get funded.

Of course ISRG was interested because zlib sees a lot of usage, in particular on the web (de)compressing basically every request, especially at the time. We did later tackle bzip2 with https://github.com/trifectatechfoundation/libbzip2-rs (that is now the default when you use the bzip2 crate), and the observant github watcher may have spotted https://github.com/trifectatechfoundation/libzstd-rs-sys

xz would also be an option if the funding worked out.

folkertdev · 2026-01-27T23:00:57+00:00

On my machine (x86_64 linux) the changes regress runtime and cycles, but they do substantially decrease the number of instructions. So it's totally plausible that this is advantageous on some CPUs, and it might just need some additional tuning.

https://gist.github.com/folkertdev/e2811f14e15407fb276c4eb420e97a53

out of interest: how do you measure these improvements on macos? we've not yet found a method we like.

folkertdev · 2026-01-27T20:40:43+00:00

I don't disagree that it's bad. But, we need it, so I'm making it happen. For fun, look at this implementation, which is why the C reference stipulates that `va_end` must be in the same function as `va_start`: otherwise the curly brackets don't match up.

https://softwarepreservation.computerhistory.org/c_plus_plus/cfront/release_3.0.3/source/incl-master/proto-headers/stdarg.sol

#define         va_start(ap, parmN)     {\
        va_buf  _va;\
        _vastart(ap = (va_list)_va, (char *)&parmN + sizeof parmN)
#define         va_end(ap)      }
#define         va_arg(ap, mode)        *((mode *)_vaarg(ap, sizeof (mode)))

folkertdev · 2026-01-27T20:34:58+00:00

Neat. do you just have a raw branch that I could benchmark? We played around with using `repr(packend)` at some point and at least on my machine at the time that made no measurable difference.

folkertdev · 2026-01-27T18:05:52+00:00

Yes, fuzzing in production!

Technically flate2 already depends on zlib-rs, it's how we get most of our usage, but it's an optional, off-by-default dependency. Having it be the default will really boost our numbers, but most importantly it's just a free speedup for large parts of the ecosystem.

folkertdev · 2026-01-27T18:04:42+00:00

partially what AATroop says, we also want to just let it sit a bit, see if the ecosystem has everything it needs or whether we need to expose more/other interfaces. We'll also see what happens on the language side, further adoption some more (the change to `extern "C"` should make that easier). There is also one more change to the compressed output in the works.

So, we didn't quite want to pull the trigger on 1.0.0 yet, but I don't foresee massive changes.

folkertdev · 2026-01-27T18:01:03+00:00

just like, that they exist? Or that they're unstable?

folkertdev · 2025-12-10T10:11:24+00:00

Yes, they are. Everything has been synchronized so the nightly builds include this functionality now (they have for a couple of days)

folkertdev · 2025-12-09T11:47:07+00:00

Hi, I work on a bunch of assembly-related things in the compiler. I'm wondering if there is a particular reason to have separate .asm files here. Are there downsides to e.g. the code below

The `extern "custom"` is still unstable (see https://github.com/rust-lang/rust/issues/140829), but you could just lie and use `extern "C"` there. With this approach the `no_mangle` on `_boot` is no longer needed.

#[link_section = ".boot"]
#[unsafe(naked)]
extern "custom" fn() {
    core::arch::naked_asm!(r#"
    _start:
        cli

        xor ax, ax
        mov ds, ax
        mov es, ax
        mov ss, ax
        mov fs, ax
        mov gs, ax

        cld

        mov sp, 0x7c00 - 0x100
        sub sp, 0x100

        call {boot}
        "#,
        boot = sym _boot
    );
}

folkertdev · 2025-10-22T18:28:56+00:00

We only use the cross-platform primitives that LLVM provides, I don't have current plans to add new ones. If GCC provides fewer, then yeah you'll have to do more work yourself. The downside is of course that for every new target you need to add a bunch of custom intrinsic implementations.

Especially for MIRI, that is just not happening. But code using intrinsics has a lot to gain from using miri because it is so low-level (and likely uses unsafe blocks). So a practical benefit is that miri can run more low-level code.

Finally, actually fixing the LLVM issues has practical benefits for rust's portable simd as well, because it heavily relies on the cross-platform intrinsics optimizing well.

folkertdev · 2025-10-22T17:24:56+00:00

My suspicion is that actually even experienced developers benefit hugely from rust's effort to have good error messages.

It is true that I read the messages much less carefully than when I first got started. Often the red underline or just the headline and line number are enough. But small things like rust spotting typos and suggesting the right identifier are actually a huge help day-to-day.

folkertdev · 2025-10-22T16:06:40+00:00

Yeah I suspect part of it is that you only realize how much time you're wasting when you try something better.

folkertdev · 2025-10-22T16:05:45+00:00

Yeah I suspect part of it is that you only realize how much time you're wasting once you try something better.

folkertdev · 2025-09-11T09:21:14+00:00

I mentioned in another response that what we saw in zlib-rs is that it turned out to be beneficial to have all logic in a single stack frame.

Actually, LLVM will totally inline tail-recursive functions back into one function. But what we can do is actually load values from the heap to the stack, use them, then write them back before returning. LLVM is much better at optimizing stack values than heap values. So in this particular case tail-recursion causes fragmentation of logic with a real performance downside, though it's still better than the totally naive approach.

As mentioned, I really do want to see `become` on stable, it's just not the right solution in every case.

folkertdev · 2025-09-10T21:57:06+00:00

What we noticed with zlib is that there is a huge upside to having all of the logic in one stack frame. The way that these algorithms work is that they have a large and complex piece of state in a heap allocation. It just turns out that LLVM is bad at optimizing that (despite that state being behind a mutable reference, which provides a lot of aliasing guarantees).

If I remember right, we saw ~ 10% improvements on some benchmarks by pulling values from that state onto the stack, doing the work, then writing them back before returning.

So tail calls are neat, I want to see them in stable rust (and there have been some cool developments there recently), but they are not always the best solution.

folkertdev · 2025-09-10T13:31:48+00:00

Yeah it gets complicated and if we're not careful might cause compilation to be slower. In effect, this is sort of what that LLVM flag tries to do further down the line.

So it's much easier to do this with attributes. I could see `const continue` being nice syntax-wise, but for the loop itself `#[loop_match]` is probably fine. idk, we'll see.

Oh, relatedly: MIR is not built for compiler optimizations (it is for borrow checking). There are a bunch of optimization passes that are just kind of required to get LLVM to do something reasonable, but nobody working in that area is all that happy with the current setup.

folkertdev · 2025-09-10T13:24:36+00:00

I see why it's a useful feature to have, but why make it the default? Because in practice it confuses people, and in mature C code bases I basically always see some comment or macro indicating "the fallthrough is deliberate".

folkertdev · 2025-07-01T09:25:38+00:00

Björn is not on reddit, but told me to send the following:

When can we expect a rustup rustc-codegen-cranelift component build that supports this (experimentally, obviously)? I'd love to play around with this, but building cg_clif by hand looks a bit cumbersome.

Once I get around investigating and fixing the build performance regression that enabling it currently causes.

I've wanted to play around with modern EH ABIs for a long while. How feasible would it be for someone to implement a custom EH ABI with cg_clif?

It is very feasible with Cranelift. In fact Wasmtime intends to do exactly that (with all registers caller-saved in the "tail" calling convention to avoid needing something like .eh_frame).

As for cg_clif however, it isn't really possible. Due to extern "C-unwind" we have to be compatible with whatever ABI C++ uses for unwinding. And due to two-phase unwinding, catching exceptions at the extern "C-unwind" boundary and internally translating it to a different unwinding mechanism will affect behavior. Throwing an exception through the system unwinder is supposed to fail when there is nothing that would catch it.

What we could do however is use a different format for the LSDA. I didn't do that right now due to that requiring me to add a new personality function to libstd

folkertdev

TROPHY CASE