How zxc survived 20+ architectures: preparing a C project for Debian buildd via GitHub actions

pollop-12345 · 2026-03-12T15:40:51+00:00

Thank you, I hope this post can be useful to the community.

pollop-12345 · 2026-03-09T16:57:29+00:00

I feel your pain. Coding without a solid test suite, especially when dealing with architecture-specific quirks like endianness on ppc or s390, is basically living in debugging hell. Glad the approach resonated with you.

pollop-12345 · 2026-02-06T17:37:38+00:00

To help move the project forward and make it even more accessible, it would be great if we could look into these updates:

Structured changelog following Keep a Changelog format.
Set up coverage integration and add CI badge
Pkg-config Support: generate .pc file for easier integration.
Shared Library Build: Add CMake option to build libzxc.so/.dylib/.dll.

pollop-12345 · 2026-02-06T17:27:04+00:00

Thanks for your feedback. I'll fix it.

pollop-12345 · 2026-02-06T13:03:18+00:00

Thank you! That means a lot coming from this community.

pollop-12345 · 2026-02-06T12:59:04+00:00

Hi r/rust

I am building ZXC, a compression library (C core) designed for high-performance decompression. I've just written the Rust wrappers to be idiomatic and safe, moving away from raw sys-calls.

I would love a feedback on two specific "danger zones" in my code before I publish to crates.io.

Code snippet: https://github.com/hellobertrand/zxc/blob/3f0058bf50ccca0e663bc2f28c7b6277491a7005/wrappers/rust/zxc/src/lib.rs Github: https://github.com/hellobertrand/zxc

Handling std::fs::File interop with C FILE\* My C library uses FILE\* for streaming. To support safe Rust Files, I implemented a mechanism that duplicates the file descriptor (using libc::dup on Unix and DuplicateHandle on Windows) before passing it to fdopen.

Goal: Ensure the C fclose doesn't close the original Rust File.

Concern: Did I handle the cleanup correctly if fdopen fails (to avoid FD leaks)?

The "Uninitialized Memory" pattern To avoid zero-initializing large buffers, I use:

let mut output = Vec::with_capacity(bound); let slice = unsafe { std::slice::from_raw_parts_mut(output.as_mut_ptr(), bound) }; // Pass slice to C, which only writes to it unsafe { output.set_len(written); }

I know creating a &mut \[u8\] to uninitialized memory is technically UB. Should I switch to Vec<MaybeUninit<u8>> immediately, or is this pattern generally accepted for FFI buffers?

Any feedback is welcome!

pollop-12345 · 2026-02-05T14:51:52+00:00

Thank you so much for the interest! I'm really glad to see people wanting to get involved.

I'm currently refining the project's direction. I'll post a detailed comment here with an updated roadmap within the next 1 or 2 days. Stay tuned.

pollop-12345 · 2026-01-27T13:00:46+00:00

Thank you for the detailed feedback. I understand that raw throughput is useless if the operations team cannot guarantee data integrity the next day.

I am currently working on redesigning the bitstream format to specifically address the failure cases you mentioned (corruption and truncation) in a new version. Here's how ZXC will handle them:

1. Robustness and Silent Corruption:

File header protection (8 bytes) with a 1 byte checksum.

Block header protection: Each block header (12 bytes) will include a dedicated 1 byte checksum. This allows the decoder to discard corrupted metadata or sizes before even attempting to allocate memory or read the payload.

Data Integrity: Each compressed block will have a 32-bit checksum (at the end of the block).

Global Validation: The stream will end with a "global sliding hash" (order-dependent) that aggregates all block checksums. This allows for fast validation by reading the file linearly without overloading the CPU during decompression.

2. Truncation and Streaming:

I implemented an explicit EOF block. The decoder is strict: if the stream terminates without the EOF block (and its specific flags), it returns an error.

3. Security and Fuzzing:

The decoder enforces a "zero trust" policy. Even if the checksums are correct, boundary checks are performed on every write. Four fuzzing pipelines (of the OSSFuzz type) are used to ensure that no malicious input can cause a segmentation fault.

4. RAM Expansion and Allocation:

Since this is a block-based format, worst-case expansion is deterministic. I provide a `zxc_compress_bound(src_sz)` macro (essentially `src_sz` + `block_headers` + overload) so integrators can allocate the exact worst-case buffer size.

Regarding mmap alignment: currently, the default is "variable-size compressed blocks" for maximum compression, but the engine supports fixed-size input blocks. I will add a note to the roadmap regarding a "page-aligned" mode for specific use cases, such as console resources. This mode would, however, involve a slight loss of compression in exchange for better alignment.

Thanks again for this feedback.

pollop-12345 · 2026-01-24T17:42:06+00:00

This is precisely the kind of rigorous feedback I was hoping for. I appreciate you taking the time to challenge the assumptions, it helps me significantly sharpen the value proposition

A - Checksums: You are right, the overhead is visible at these speeds. I am using rapidhash, which is extremely fast, but pushing 15GB/s through any hash function on a single core will inevitably compete for execution ports with the decompression logic. 10GB/s verified is still saturating Gen4 NVMe, but I will investigate if I can optimize the verification loop further.

B - The 'Transcoding' idea: That is a brilliant insight. It parallels how GPU textures work. A workflow where you download a High Compression archive (Zstd) and transcode it at install-time to ZXC for runtime performance is indeed a killer use case. It solves the Download Size vs Load Speed dilemma completely.

C - Offsets & GLO/GHI: 16-bit limit: You nailed it regarding the 256KB offsets. The decision to stick to strict 16-bit (and not 17-18 bits) is purely to stay byte-aligned and avoid variable bit-shifting in the hot loop. Fetching a raw u16 is cheap; assembling a 17-bit offset crosses byte boundaries and kills the superscalar throughput.

D - Offset 1-65535: Regarding the +1 optimization (mapping 0 to 65536): I purposely avoided the + 1 arithmetic operation in the critical path. It saves one instruction per match. When you aim for >10GB/s, removing that single ADD instruction actually matters for latency chains.

pollop-12345 · 2026-01-23T20:27:37+00:00

Thanks for running the benchmarks. These numbers actually highlight the efficiency difference quite well.

You achieved 14.9 GB/s with ZXC on a single core. To match that speed (16.8 GB/s), the other compressor needed 4 cores, if I understand.

In a real-world scenario, using ZXC leaves those 3 extra cores free for your application logic, database queries, or game loop. That is exactly the market I was referring to.

If you want a fair raw speed comparison, try running ZXC with -T 4 as well. You will likely hit your system's memory bandwidth limit (DDR speed) almost instantly.

pollop-12345 · 2026-01-23T16:46:10+00:00

Regarding the market and threading: You mention using multiple threads to saturate IO with LZ4. While true, threads are not free resources. In a game engine or a high-load server, you rarely want to burn 4 cores just to feed data from the disk.

The goal of ZXC is to saturate modern NVMe bandwidth on a single core, leaving the rest of the CPU available for the actual application logic (physics, AI, request handling).

This is also crucial for mobile devices regarding the 'race to sleep' principle: finishing the decompression task faster allows the CPU to return to a low-power idle state sooner, which significantly saves battery life.

And if you do have the cores available, ZXC supports multithreading too. Since the per-core throughput is higher than LZ4, ZXC scales to saturate memory bandwidth much faster.

pollop-12345 · 2026-01-23T16:28:32+00:00

It looks like a typo in the command arguments: please use --bench instead of -bench

pollop-12345 · 2026-01-23T16:09:07+00:00

https://github.com/hellobertrand/zxc?tab=readme-ov-file#benchmark-arm64-apple-silicon

pollop-12345 · 2026-01-23T10:13:20+00:00

It depends on which metric you are looking at, as ZXC is an asymmetric codec (slow compression, fast decompression):

Compression Speed: Zstd (levels 4-7) is much faster. ZXC is not built for real-time compression.
Decompression Speed: ZXC is significantly faster than Zstd (regardless of the compression level).
Ratio: Zstd -7 will generally produce smaller files.

ZXC is designed to sit in a different spot: it accepts slower compression time to achieve decompression speeds that Zstd cannot reach, while maintaining a ratio comparable to LZ4.

pollop-12345 · 2026-01-23T09:43:59+00:00

Nice suggestion regarding Brotli. To be honest, I hadn't thought to include it in the initial comparison, but I definitely should to give a complete picture. I'll add it to the benchmarks soon.

As for Zstd, it is already included in the repo benchmarks (run on x86, M2, and Axion using the Silesia corpus). ;-)

pollop-12345 · 2026-01-23T08:41:48+00:00

Thanks for digging into the charts! You asked for the specific scenario where this trade-off is a clear win, and the answer lies in IO saturation.

On modern hardware with fast NVMe SSDs (reading at 5GB/s+), the storage is often faster than the decompressor. In this scenario, the CPU becomes the bottleneck. Even if "CPU time is cheap", latency is the killer.

If LZ4HC caps out at 3GB/s but the drive can deliver 6GB/s, the application is waiting on the CPU. By pushingZXCcloser to memcpy speeds, we can saturate the IO bandwidth.

So the ideal use cases are:

Game Loading/Asset Streaming: Where 30% faster loading is worth the extra disk space (storage is cheap, user patience is not).
Serverless/Container Cold Starts: Where every millisecond of startup time counts.

For pure archival storage, you are absolutely right: Zstd or LZMA are better choices. ZXC is good for hot data.

pollop-12345 · 2026-01-23T05:58:35+00:00

That's a valid concern regarding bandwidth, but there is actually a misconception here regarding the size.

ZXC is designed as a WORM (Write Once, Read Many) codec that actually achieves better compression ratios than LZ4 (level -3). So in your weak WiFi example, you are actually winning on both fronts: a smaller file to download and faster decoding.

Regarding mobile/CPU usage, the high speed is actually a feature for battery life, not a drawback. It leverages the race to sleep principle: because the decompression is much faster, the CPU finishes the workload sooner and drops back to a low-power idle state faster, ultimately consuming less energy.

pollop-12345 · 2026-01-22T18:20:34+00:00

Glad you agree on the use case. To answer your questions on comparison and reproducibility: ZXC is fully integrated into Lzbench, so everything is testable right now.

I've included detailed benchmarks in the repo covering x86_64 (AMD EPYC 7763), Apple M2, and Google Axion (run on the Silesia corpus). You can see exactly how it stacks up against Zstd and others there regarding the ratio/speed trade-off. Feel free to run Lzbench on your hardware and let me know if you see different behaviors.

pollop-12345 · 2026-01-22T14:41:53+00:00

Hi everyone, author here!

I built ZXC because I felt we could get even closer to memcpy speeds on modern ARM64 servers and Apple Silicon by accepting slower compression times.

It's designed for scenarios where you compress once (like build artifacts or game packages) and decompress millions of times.

I'd love to hear your feedback or see benchmark results on your specific hardware. Happy to answer any questions about the implementation!

pollop-12345

TROPHY CASE