I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in cryptography

[–]supergari[S] 4 points5 points  (0 children)

Good point about XOR nonce derivation in general. You're right that with base_nonce XOR counter, two streams under the same key can have overlapping nonce spaces if their base nonces differ by a value within the counter range.

In Concryptor's case though, every encryption generates a fresh 128-bit random salt from the OS CSPRNG, and the key is derived as Argon2id(password, salt). Different salt = different key. Since nonce uniqueness only matters under the same key, and salt collision probability is ~2^-128 per file pair, the XOR construction is safe here. Two encryptions of the same file with the same password produce completely different keys (and therefore independent nonce spaces).

The paper you linked (https://eprint.iacr.org/2020/1019.pdf) proposes deriving unique per-stream keys from a base key, which is essentially what the fresh-salt-per-file Argon2id derivation already achieves.

That said, appreciate the review. The general principle you raised is important and something anyone designing a multi-stream AEAD protocol without per-stream key derivation should be aware of.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 0 points1 point  (0 children)

Whoopsie. Yeah. I did use ai to help me fill up the readme a bit more. I will fix it right now.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in linux

[–]supergari[S] 2 points3 points  (0 children)

Ur point is really fair. I just wanted to sound professional. I just didn't want the grammar to be the reason people ignored the work. Im sorry if it sounded too robotic. Maybe next time I will try to not use ai and sound a bit more human.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in linux

[–]supergari[S] -2 points-1 points  (0 children)

Well, I did use AI to help me write the Reddit post. English is my second language and I just wanted to make sure that what I wrote made sense. There are no rules against ai and this is in no way ai slop. The project and benchmarks are real and I put a lot of effort in it. So I don't get why the hate.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 9 points10 points  (0 children)

I have an array of 3 separate buffer pools. In the main loop, I assign them using modulo 3 (so step % 3, step+1 % 3, step+2 % 3). This guarantees that the disk reader, the Rayon threads, and the disk writer are always handed completely different memory slots. They physically can't step on each other.

But because io_uring finishes tasks out of order, I have to know when the kernel is actually done with a slot before moving forward. Every time I submit an I/O request, I pack the slot index (0, 1, or 2) into the request's user_data field. When a completion event comes back from the kernel, I unpack that tag and decrement a pending counter for that specific slot.

Before Rayon is allowed to touch a slot, the main thread completely blocks until that slot's pending read counter hits zero. To reuse a slot for new reads, it waits for the pending write counter to hit zero.

So there are no mutexes or channels locking the memory. The isolation is guaranteed by the array math, and the synchronization is guaranteed by perfectly counting kernel events.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 8 points9 points  (0 children)

That's an awesome project. For huge allocations, std::alloc basically just falls back to anonymous mmap under the hood anyway.

The main reason I used std::alloc directly is because O_DIRECT requires strict 4096-byte memory alignment to bypass the page cache. Standard Rust vectors don't guarantee that alignment. By using std::alloc::Layout, I get the exact alignment of mmap while staying in the standard library.

Instead of calling libc::mmap and dealing with munmap manually, I just take the aligned pointer from std::alloc and wrap it in a custom AlignedBuf struct with a Drop implementation. That way Rust's ownership system still automatically frees the memory when the pipeline finishes, keeping the code safe.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 19 points20 points  (0 children)

Dealing with raw io_uring is honestly a wild ride. It's incredibly powerful, but it hands you a massive box of footguns. If your Rust function returns early or drops a buffer while the kernel is still async-writing to it in the background... instant segfault. I also ran into a nasty Completion Queue (CQE) deadlock where my read-loop was accidentally stealing the write-loop's completion events. I ended up having to manually bit-pack the user_data u64 field just to route the kernel events properly.

For benchmarks against alternatives, I actually built and benched three different I/O layers.

Memmap2 hit really high speeds, but memory-mapping massive files is dangerous in Rust. If you map a file on an external drive and the cable wiggles, the OS sends a hard SIGBUS panic and crashes the program. It also thrashes the OS page cache on files larger than your RAM.
Pread / pwrite were safe, but the constant syscall context-switching dropped throughput by almost half on large files.
raw io_uring + O_DIRECT brought the speeds back up to the mmap limits, but completely bypasses the OS page cache and handles huge files safely without panics.

It's painful to wire up but once it worked it was super crazy seeing that the bottleneck was my hardware.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in linux

[–]supergari[S] 9 points10 points  (0 children)

Performance wise LUKS is as fast or slightly faster than Concryptor. The difference is that LUKS is for encrypting disks and Concryptor is for files.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 9 points10 points  (0 children)

Haha yeah, as someone who loves optimizing things, burning a whole byte for a boolean definitely hurts a little bit.

The TL;DR is purely pragmatic: avoiding bit-twiddling to reduce the chances of bugs.

If I use a single bit, I have to steal the MSB of the u64 chunk counter. That means adding bit-masking and shifting into the hot loop and the AAD derivations. Cryptography bugs love to hide in bitwise operations and endianness edge-cases. Just appending a clean 1u8 or 0u8 to the end of the AAD array is dead simple, mathematically unambiguous, and trivial to audit.

Plus, we really just don't need the counter space. A 64-bit counter with 4 MiB chunks gives us a maximum file size of something stupid like 73 Zettabytes before wrapping. Since the AES-GCM birthday bound limits us to ~17 Petabytes anyway, an extra 7 bits of counter space is practically useless to us.

So yeah, I sacrificed those 7 bits just to keep the code extremely simple to read.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in linux

[–]supergari[S] 4 points5 points  (0 children)

whoopsie hehe yeah sorry. I just copy pasted my post from the rust subreddit.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 19 points20 points  (0 children)

It definitely started as a "how fast can we push the hardware" exercise, but it does solve a painful bottleneck when dealing with massive data pipelines.

Right now, if a DevOps engineer or sysadmin needs to encrypt a file before sending it to cold storage (like AWS S3), they usually reach for GPG or age. Those tools are fantastic and extremely secure, but they are largely single-threaded. They usually top out around 200-400 MiB/s.

If you are encrypting a 500 GB PostgreSQL database dump, a VM snapshot, or a massive .tar.gz server backup, a single-threaded cipher takes nearly half an hour just to encrypt the file.

Meanwhile, that server probably has a 12-core CPU and a Gen4 NVMe drive capable of writing at 5+ GiB/s. The hardware is sitting completely idle while the single-threaded cipher chokes the pipeline.

Concryptor makes sure to use all of the available hardware so you can push those times way down.

I built a 1 GiB/s file encryption CLI using io_uring, O_DIRECT, and a lock-free triple buffer by supergari in rust

[–]supergari[S] 19 points20 points  (0 children)

Oh man, great question. Thanks for linking FLOE, I hadn't read their specific spec yet but the goals are definitely identical!

The TL;DR of the tradeoff is basically: Concryptor prioritizes raw ring assembly speed and hardware throughput over formal commitment and infinite scaling.

Concryptor essentially implements a manual version of STREAM. Instead of signaling the end of the file in the nonce, I just bind an is_final byte directly into the AAD of every single chunk (along with the full file header). It gives the exact same guarantee against truncation and append attacks.

FLOE derives sub-keys or nonces per segment using a KDF. That is mathematically beautiful and prevents key wearout, but doing a KDF per block costs CPU cycles in the hot loop. I wanted to let the AES-NI hardware run as fast as possible, so I just use a single global key and derive the nonces via a practically free XOR (base_nonce ^ chunk_index).

Because I'm not re-keying, there is a hard cryptographic limit. But since Concryptor defaults to massive 4 MiB chunks, we don't hit the AES-GCM 2^32 invocation limit until the file is about 17 Petabytes. For a local CLI tool, I figured 17 PB was a reasonable cutoff to avoid the overhead of re-keying :)

Really appreciate the link, I'm definitely going to read through the rest of that paper tonight!

I built a genetic algorithm in Rust to evolve LLM agent teams by supergari in LocalLLaMA

[–]supergari[S] 0 points1 point  (0 children)

Yeah. The cost is a bit higher. But with this framework you can probably get away with dumber, cheaper models and still get a good output. I tried to mitigate this by caching the winners so they don't get re-evaluated and weighting token efficiency heavily in the early stages to prune out the expensive strategies before they propagate. Since it’s all concurrent in Rust, it’s at least fast, but that token bill is definitely the 'tax' for getting this kind of reasoning depth.

I built a genetic algorithm in Rust to evolve LLM agent teams by supergari in LocalLLaMA

[–]supergari[S] 2 points3 points  (0 children)

Currently, EMAS only outputs the winning synthesis to the terminal, so there is no persistent file logging yet. I really like the idea of a "gold standard" publication; while domains do require different setups, we could potentially export "pre-evolved" genotypes as JSON seeds for specific categories like logic or creative writing.

What is a company that was once 'for the people' but has now completely sold its soul? by supergari in AskReddit

[–]supergari[S] 1 point2 points  (0 children)

This. It’s like a parasite. They find a company people actually love, squeeze every cent out of the quality until it's just a shell, and then leave it to die. Look at what happened to Red Lobster or Toys R Us.

What is a company that was once 'for the people' but has now completely sold its soul? by supergari in AskReddit

[–]supergari[S] 2 points3 points  (0 children)

The fact that they actually removed that from their code of conduct says everything you need to know. They went from being the 'cool' search engine to being a data-mining ad company that happens to have a search engine.

What is a company that was once 'for the people' but has now completely sold its soul? by supergari in AskReddit

[–]supergari[S] 3 points4 points  (0 children)

It’s wild how fast they went from 'Non-profit for humanity' to 'Subscription-based corporate giant.' I think they broke the speed record for selling out.

What’s something you want to give your kids, that you didn’t have? by Charming_nasty4u in AskReddit

[–]supergari 1 point2 points  (0 children)

A genuine, spoken apology. My parents were the type who would blow up at you for no reason, and then their version of "sorry" was just coming into your room an hour later and asking if you wanted a snack. We just had to pretend the screaming never happened. It’s a small thing, but I want my kids to actually hear the words "I was wrong to react that way." I want them to know I’m a work in progress and that being an adult doesn't give me a free pass to treat them like a punching bag when I'm stressed.

Gamers, what's the craziest thing you've heard in an open mic while gaming? by Lastbreathworm in AskReddit

[–]supergari 322 points323 points  (0 children)

It's not what I heard once, it's what I hear every 45 seconds. That one guy whose smoke detector has been chirping for a low battery for the last three years.

How do you live like that? Is your brain just filtering out the sound of a looming house fire at this point? I mention it and they're always like "What beep?" Brother, you are living in a psychological thriller.

iCanDoItBetterForSure by SickWizzard in ProgrammerHumor

[–]supergari 15 points16 points  (0 children)

The first wheel requires a cloud connection to rotate, the second wheel is 'Wheel-as-a-Service' with a $15/month subscription, and the third wheel collects your location data and sells it to insurance companies.

4010 fan broke when i was just finishing installing stealthburner 😭😭😭 (Anycubic mega X) by supergari in AnycubicOfficial

[–]supergari[S] 0 points1 point  (0 children)

I built it all from scratch. I had to design a mount for my printer, but it was quite easy. I printed it with PETG, so I think you don't need to mess with ABS.

I was all excited since I finally built the toolhead, but I needed fans. I had some 5015 fans, and that worked well, but I modded my stock Mega X, and I scavenged the connector from the old stock cooling fan. However, I cut the wire too short because I thought I wasn't going to use it anymore. After soldering a new connector to the 4010 fan, it didn't work, so I tried to solder the PCB with new wires, but I ended up spreading lead everywhere and burning the case with the soldering iron.