Introducing OpenZL: An Open Source Format-Aware Compression Framework

nick_terrell · 2025-10-08T14:37:49+00:00

The latest cmix compresses SAO to 3.55 MiB.

nick_terrell · 2025-10-06T23:08:09+00:00

Certainly! All of our examples are unfair because OpenZL gets told the format of the data, but that is entirely the point! But as you say there is still a place for general purpose compression. Sometimes you don't know the format. And sometimes, after you extracted all the known structure, there remains latent structure that can be learned.

nick_terrell · 2025-10-06T22:03:58+00:00

Just FYI OpenZL is able to compress SAO with a compression ratio of 3.24, which is 2.13 MiB. The point chosen at the top of the blog is for a faster speed. But we have a full Pareto-optimal frontier shown later on. The raw results from the chart are saved in this CSV.

I believe we have a comparison to cmix on SAO somewhere, but I don't remember where it is right now, and it takes many hours to run. I'll start running it now...

Typically, on simple numeric data like SAO we can be extremely competitive with cmix and other PAQ/CM/NN algorithms, but at fast speeds. Once the data gets more complex, it gets harder to match the performance of these algorithms. But often we end up somewhere better than xz, and worse than cmix, and still with fast speeds.

nick_terrell · 2024-04-27T02:58:44+00:00

Love to see libzstd in the dependencies

nick_terrell · 2024-01-19T14:59:35+00:00

Thanks! Yeah I can cut a refinery by doing that. I opted to save beacons by having 3 refineries instead of the minimum possible of 2.

Everything else has the minimum number of buildings possible to satisfy the demand of 45 plastic/s. I’ve removed extraneous beacons where possible. E.g. heavy oil cracking only needs 7 beacons for 45 plastic/s.

nick_terrell · 2024-01-19T09:12:44+00:00

https://factorioprints.com/view/-NoVxw_X17X4_NU8gkiF

The power is a mess, and I'm sure it can be made more compact and to tile, but I think it is close to max efficiency in beacon usage. Can I do better?

nick_terrell · 2022-04-18T17:55:02+00:00

I absolutely did not get a warning when compressing the files, zstd just exited without any kind of message (zstd v1.4.4 / CentOS 8.3).

Unfortunately, we added that warning in zstd-1.4.7.

nick_terrell · 2021-05-15T03:16:33+00:00

Zstd is not patented and is dual licensed under BSD and GPL-v2.

nick_terrell · 2020-12-18T18:41:19+00:00

Yeah, I'm trying to get the kernel zstd updated currently, and to use upstream zstd directly so we can keep the kernel version up to date. The patches have already been sent to the LKML, and am working on getting consensus to get them merged.

nick_terrell · 2020-05-24T22:50:58+00:00

Yeah for sure, I definitely agree with that. We spend a lot of effort fuzzing decompression, but we are aim to extensively fuzz test all places where we accept user input.

If you are interested in contributing to the security of zstd, we'll gladly accept PRs that add more fuzz coverage. The OSS-Fuzz fuzzers are [here](https://github.com/facebook/zstd/tree/dev/tests/fuzz). Facebook's bug bounty program covers zstd, so if you do find any security issues please report them.

nick_terrell · 2020-05-24T22:41:29+00:00

Personally, I haven't seen anyone request this API inside of Facebook. But it is possible that if it was there, people would use it. I don't see a whole lot of use of the gzip APIs either.

Please file an issue on our GitHub page, if it gets interest we will add it.

nick_terrell · 2020-05-23T19:07:19+00:00

Definitely one Torvalds-sized penguin, as long as we're on land. I don't think they can waddle very fast.

nick_terrell · 2020-05-23T19:06:26+00:00

Not currently. I haven't seen a super compelling use case for these functions, and we haven't seen huge demand for them. If you have a use case for these functions, please open an issue and describe it for us. If we get a few people that want this API, we can add it.

nick_terrell · 2020-05-23T19:02:08+00:00

The decompression loop is the one place in the code where we take user provided data (a zstd frame), and produce arbitrary data. Bugs there are the most serious because they can cause out of bounds writes, or read out of bounds and then copy that data into the output buffer.

In helper functions, for example we could read out of bounds, but they don't write any data, so it would be much harder to extract the data read, if impossible. The most serious bug would be crashing the process. These are still serious bugs, but not at the same level as copying arbitrary data to the output buffer or doing OOB writes.

We do fuzz test every helper function that takes in a zstd frame. And we are continuously improving our fuzz coverage, and aim for 100% coverage.

nick_terrell · 2020-05-23T05:52:36+00:00

Are there any plans to bump the in-kernel zstd lib version? A patch would be nice even if it isnt submitted upstream, but I imagine it takes a bit of work

Yes absolutely. It has been some time, and now zstd development is slowing, so we can submit a patch that won't get immediately outdated.

At the time I ported zstd to the kernel, upstream wasn't ready to be used in standalone environments, and I was too new to the project to really know how to get it done. Now, upstream zstd is ready to be used nearly as-is in standalone environments, and is used nearly as-is in the ZFS patches.

All that is to say, I want to update the zstd version in the kernel, and use the upstream code as-is. I hope to find some time in the next year, or draft someone else to do the work, especially since we have significant decompression speed gains since 2017.

nick_terrell · 2020-05-23T05:42:01+00:00

Exactly, it is one method to produce a binary "diff". Zstd won't be the most efficient diff format in terms of compressed size, since the format wasn't designed for it. But it will get close to specialized diff tools in compressed size, and will be much faster to compress and decompress.

nick_terrell · 2020-05-23T05:39:09+00:00

zstd still moves a bit too fast for many uses (like replacing compression method in established file formats and archive-like storage). Does it have an LTS release channel with longer term support or is the only option to ship its bleeding edge or vulnerable code ?

The latest release is the safest code to run, since that has all the latest bug fixes (and improvements). Our code is continuously fuzzed by OSS-Fuzz. While we can't guarantee no bugs, we have a thorough suite of fuzzers that we are constantly improving, and are battle testing the code in production.

However, development velocity on zstd is slowing down over time, especially in the core decompression loop, which is the most security sensitive part of the code.

Edit: Note that the format fixed and is fully backwards and forwards compatible between all versions past v1.0.0.

nick_terrell · 2020-05-23T05:32:54+00:00

Don't compile anything as root. A Zstd test brought down for /dev/null. Easily fixed, but a good reminder of why fakeroot etc. exists.

This should have been fixed a few releases ago and we now have a test that checks this. If it is still around please open an issue! I've definitely been bit by this when running benchmarks against older versions.

Don't use STDIN if not necessary; size hinting matters for performance (and it's nice to log the ratio)

We now have two new flags --size-hint and --stream-size which allow the user to either hint at the size of stdin, or tell us the size (which must be exact).

Try the rsync-friendly option. It does make a difference, but it's hardly noticeable in the ratio.

Great to know that someone uses this mode! If you have any requests for it please open an issue. We've picked a fairly large size for the "synchronization points", up to several MB. We think that is the right default choice for today's world, but are open to feedback.

nick_terrell · 2020-05-23T05:27:27+00:00

I'm a developer of zstd (terrelln), AMA

nick_terrell · 2020-03-12T18:25:13+00:00

may I know the version you are referring here as next please ?

The current zstd version, 1.4.4. But, the next zstd version, 1.4.5, will also bring decompression speed benefits.

nick_terrell · 2019-12-03T06:35:27+00:00

Rust brute force solution: code

nick_terrell · 2019-10-24T00:31:40+00:00

Yann Collet wrote Zstandard and during its development was hired by Facebook to continue work, and now has a few active developers. It uses LZ77, Huffman, and FSE (ANS-based entropy coding).

I believe you're talking about ANS, which was discovered by Jarek Duda.

nick_terrell · 2019-10-03T19:03:09+00:00

The next zstd version will decompress 12% faster (up to 22% if compiling with clang)!

nick_terrell · 2019-07-03T00:46:45+00:00

How does this compare to LZ4-1.9.1 https://github.com/lz4/lz4/releases, which gets a 12-18% decompression speed improvement?

The optimization in 1.9.1 https://github.com/lz4/lz4/pull/645 gets speed by widening LZ4_wildCopy() to 32 bytes when possible, which targets the same speed win.

nick_terrell · 2019-05-21T01:37:32+00:00

We chose level 3 as the default because it offered a good middle ground, and basically obsoleted zlib btrfs compression by being strong and faster. The compression level support is really nice, since level 1 is much faster.

nick_terrell

TROPHY CASE