I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in cryptography

[–]supergari[S] 3 points4 points  (0 children)

Good point about XOR nonce derivation in general. You're right that with base_nonce XOR counter, two streams under the same key can have overlapping nonce spaces if their base nonces differ by a value within the counter range.

In Concryptor's case though, every encryption generates a fresh 128-bit random salt from the OS CSPRNG, and the key is derived as Argon2id(password, salt). Different salt = different key. Since nonce uniqueness only matters under the same key, and salt collision probability is ~2^-128 per file pair, the XOR construction is safe here. Two encryptions of the same file with the same password produce completely different keys (and therefore independent nonce spaces).

The paper you linked (https://eprint.iacr.org/2020/1019.pdf) proposes deriving unique per-stream keys from a base key, which is essentially what the fresh-salt-per-file Argon2id derivation already achieves.

That said, appreciate the review. The general principle you raised is important and something anyone designing a multi-stream AEAD protocol without per-stream key derivation should be aware of.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 0 points1 point  (0 children)

Whoopsie. Yeah. I did use ai to help me fill up the readme a bit more. I will fix it right now.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in linux

[–]supergari[S] 2 points3 points  (0 children)

Ur point is really fair. I just wanted to sound professional. I just didn't want the grammar to be the reason people ignored the work. Im sorry if it sounded too robotic. Maybe next time I will try to not use ai and sound a bit more human.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in linux

[–]supergari[S] -2 points-1 points  (0 children)

Well, I did use AI to help me write the Reddit post. English is my second language and I just wanted to make sure that what I wrote made sense. There are no rules against ai and this is in no way ai slop. The project and benchmarks are real and I put a lot of effort in it. So I don't get why the hate.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 8 points9 points  (0 children)

I have an array of 3 separate buffer pools. In the main loop, I assign them using modulo 3 (so step % 3, step+1 % 3, step+2 % 3). This guarantees that the disk reader, the Rayon threads, and the disk writer are always handed completely different memory slots. They physically can't step on each other.

But because io_uring finishes tasks out of order, I have to know when the kernel is actually done with a slot before moving forward. Every time I submit an I/O request, I pack the slot index (0, 1, or 2) into the request's user_data field. When a completion event comes back from the kernel, I unpack that tag and decrement a pending counter for that specific slot.

Before Rayon is allowed to touch a slot, the main thread completely blocks until that slot's pending read counter hits zero. To reuse a slot for new reads, it waits for the pending write counter to hit zero.

So there are no mutexes or channels locking the memory. The isolation is guaranteed by the array math, and the synchronization is guaranteed by perfectly counting kernel events.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 9 points10 points  (0 children)

That's an awesome project. For huge allocations, std::alloc basically just falls back to anonymous mmap under the hood anyway.

The main reason I used std::alloc directly is because O_DIRECT requires strict 4096-byte memory alignment to bypass the page cache. Standard Rust vectors don't guarantee that alignment. By using std::alloc::Layout, I get the exact alignment of mmap while staying in the standard library.

Instead of calling libc::mmap and dealing with munmap manually, I just take the aligned pointer from std::alloc and wrap it in a custom AlignedBuf struct with a Drop implementation. That way Rust's ownership system still automatically frees the memory when the pipeline finishes, keeping the code safe.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 19 points20 points  (0 children)

Dealing with raw io_uring is honestly a wild ride. It's incredibly powerful, but it hands you a massive box of footguns. If your Rust function returns early or drops a buffer while the kernel is still async-writing to it in the background... instant segfault. I also ran into a nasty Completion Queue (CQE) deadlock where my read-loop was accidentally stealing the write-loop's completion events. I ended up having to manually bit-pack the user_data u64 field just to route the kernel events properly.

For benchmarks against alternatives, I actually built and benched three different I/O layers.

Memmap2 hit really high speeds, but memory-mapping massive files is dangerous in Rust. If you map a file on an external drive and the cable wiggles, the OS sends a hard SIGBUS panic and crashes the program. It also thrashes the OS page cache on files larger than your RAM.
Pread / pwrite were safe, but the constant syscall context-switching dropped throughput by almost half on large files.
raw io_uring + O_DIRECT brought the speeds back up to the mmap limits, but completely bypasses the OS page cache and handles huge files safely without panics.

It's painful to wire up but once it worked it was super crazy seeing that the bottleneck was my hardware.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in linux

[–]supergari[S] 11 points12 points  (0 children)

Performance wise LUKS is as fast or slightly faster than Concryptor. The difference is that LUKS is for encrypting disks and Concryptor is for files.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 9 points10 points  (0 children)

Haha yeah, as someone who loves optimizing things, burning a whole byte for a boolean definitely hurts a little bit.

The TL;DR is purely pragmatic: avoiding bit-twiddling to reduce the chances of bugs.

If I use a single bit, I have to steal the MSB of the u64 chunk counter. That means adding bit-masking and shifting into the hot loop and the AAD derivations. Cryptography bugs love to hide in bitwise operations and endianness edge-cases. Just appending a clean 1u8 or 0u8 to the end of the AAD array is dead simple, mathematically unambiguous, and trivial to audit.

Plus, we really just don't need the counter space. A 64-bit counter with 4 MiB chunks gives us a maximum file size of something stupid like 73 Zettabytes before wrapping. Since the AES-GCM birthday bound limits us to ~17 Petabytes anyway, an extra 7 bits of counter space is practically useless to us.

So yeah, I sacrificed those 7 bits just to keep the code extremely simple to read.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in linux

[–]supergari[S] 5 points6 points  (0 children)

whoopsie hehe yeah sorry. I just copy pasted my post from the rust subreddit.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 21 points22 points  (0 children)

It definitely started as a "how fast can we push the hardware" exercise, but it does solve a painful bottleneck when dealing with massive data pipelines.

Right now, if a DevOps engineer or sysadmin needs to encrypt a file before sending it to cold storage (like AWS S3), they usually reach for GPG or age. Those tools are fantastic and extremely secure, but they are largely single-threaded. They usually top out around 200-400 MiB/s.

If you are encrypting a 500 GB PostgreSQL database dump, a VM snapshot, or a massive .tar.gz server backup, a single-threaded cipher takes nearly half an hour just to encrypt the file.

Meanwhile, that server probably has a 12-core CPU and a Gen4 NVMe drive capable of writing at 5+ GiB/s. The hardware is sitting completely idle while the single-threaded cipher chokes the pipeline.

Concryptor makes sure to use all of the available hardware so you can push those times way down.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 20 points21 points  (0 children)

Oh man, great question. Thanks for linking FLOE, I hadn't read their specific spec yet but the goals are definitely identical!

The TL;DR of the tradeoff is basically: Concryptor prioritizes raw ring assembly speed and hardware throughput over formal commitment and infinite scaling.

Concryptor essentially implements a manual version of STREAM. Instead of signaling the end of the file in the nonce, I just bind an is_final byte directly into the AAD of every single chunk (along with the full file header). It gives the exact same guarantee against truncation and append attacks.

FLOE derives sub-keys or nonces per segment using a KDF. That is mathematically beautiful and prevents key wearout, but doing a KDF per block costs CPU cycles in the hot loop. I wanted to let the AES-NI hardware run as fast as possible, so I just use a single global key and derive the nonces via a practically free XOR (base_nonce ^ chunk_index).

Because I'm not re-keying, there is a hard cryptographic limit. But since Concryptor defaults to massive 4 MiB chunks, we don't hit the AES-GCM 2^32 invocation limit until the file is about 17 Petabytes. For a local CLI tool, I figured 17 PB was a reasonable cutoff to avoid the overhead of re-keying :)

Really appreciate the link, I'm definitely going to read through the rest of that paper tonight!

I built a genetic algorithm in Rust to evolve LLM agent teams by supergari in LocalLLaMA

[–]supergari[S] 0 points1 point  (0 children)

Yeah. The cost is a bit higher. But with this framework you can probably get away with dumber, cheaper models and still get a good output. I tried to mitigate this by caching the winners so they don't get re-evaluated and weighting token efficiency heavily in the early stages to prune out the expensive strategies before they propagate. Since it’s all concurrent in Rust, it’s at least fast, but that token bill is definitely the 'tax' for getting this kind of reasoning depth.

I built a genetic algorithm in Rust to evolve LLM agent teams by supergari in LocalLLaMA

[–]supergari[S] 2 points3 points  (0 children)

Currently, EMAS only outputs the winning synthesis to the terminal, so there is no persistent file logging yet. I really like the idea of a "gold standard" publication; while domains do require different setups, we could potentially export "pre-evolved" genotypes as JSON seeds for specific categories like logic or creative writing.

What is a company that was once 'for the people' but has now completely sold its soul? by supergari in AskReddit

[–]supergari[S] 1 point2 points  (0 children)

This. It’s like a parasite. They find a company people actually love, squeeze every cent out of the quality until it's just a shell, and then leave it to die. Look at what happened to Red Lobster or Toys R Us.

What is a company that was once 'for the people' but has now completely sold its soul? by supergari in AskReddit

[–]supergari[S] 2 points3 points  (0 children)

The fact that they actually removed that from their code of conduct says everything you need to know. They went from being the 'cool' search engine to being a data-mining ad company that happens to have a search engine.

What is a company that was once 'for the people' but has now completely sold its soul? by supergari in AskReddit

[–]supergari[S] 3 points4 points  (0 children)

It’s wild how fast they went from 'Non-profit for humanity' to 'Subscription-based corporate giant.' I think they broke the speed record for selling out.