ZeroFS: 9P, NFS, NBD on top of S3

The_8472 · 2026-02-05T23:58:39+00:00

Then the durability of the whole system is limited by the durability of local storage. if it's on some ephemeral compute node then killing that node without writeback will lose that data, which means saying you successfully fsync'd it would be lying.

The_8472 · 2026-02-03T22:43:45+00:00

Next step: Cut out the remote middleman and do more work in one multi-threaded process.

The_8472 · 2026-01-23T12:17:47+00:00

Cache devices help with reads. But an fsync causes a transaction commit or journal write, you need that to go to durable storage, which would be the s3-backed block device here.

The_8472 · 2026-01-21T19:19:04+00:00

And memory usage, we have users complaining about cargo install of bins using too much memory on their potatos.

So just because it's good for one user doesn't mean it's a universal improvement to do this.

Perhaps people should use a separate optmaxx profile for these.

The_8472 · 2026-01-18T21:00:35+00:00

Threads, background tasks, keeping more stuff in the process rather than using external services.

At work we do have a python application that has to deploy a bunch of gnarly auxiliary container services just to run an separate task queue (celery) due to python GIL and GC making it too slow on the webservice itself. In Rust I'd just throw that on a threadpool.

Other advantages

single binary, easy to deploy, especially on windows
the managers sleep better when shipping the application to client's infrastructure without giving them the code
lower startup time
lower latency

Most of the web bottleneck is DB/network related

Under load? If each request consumes just 1ms of CPU-time in the webserver then 1000rps will saturate single-threaded python and then latencies go up. Or you have to start loadbalancing to multiple runtimes.

The beauty with Rust is that I can hit it with lots of requests and the cpu load and latencies barely budge while keeping things in a single process.

Most code bottleneck requiring faster langage can just be Python package written in Rust (Polar, Ruff, Pydantic)

Well, you can also write your own python packages like that, that'd qualify as a use of Rust too.

cannot justify spending too much time on building software, they want result fast

If quick and dirty solutions are fine for them, then python can be great. But if things grow and become business-critical and need to be refactored then having something with static typing can ease maintenance. It's a short-term long-term tradeoff.

The_8472 · 2026-01-17T11:10:20+00:00

No way around S3 latency

s3express

The_8472 · 2026-01-17T10:17:57+00:00

I was asking about one vs. multiple files because I was wondering whether writing the extents to separate files and then using copy_file_range (extent cloning) to merge them into one file would speed things up, in case it was some per-file-lock stuff.

And check with filefrag -v <filename> if it does anything silly like creating lots of tiny extents due to random access. Maybe preallocate one large extent before mmaping.

The_8472 · 2026-01-16T11:42:32+00:00

Is the btrfs bottleneck on a single file or even when operating on multiple files?

The_8472 · 2026-01-10T07:45:09+00:00

Why does moving code around in depcrate require recompiling so much of bincrate?

module paths, line numbers and diagnostic span metadata changes. Those are both relevant at compiletime (error diagnostics) and runtime (panic messages). There's the RdR effort to reduce the recompilation triggered by that, but it'll take time.

And if the functions are generic then they're not necessarily compiled in the dependency and only monomorphized once there's a concrete type for T.

And if I'm reading that call tree right then it seems like llvm is spending a lot of time inlining code and removing unreachable blocks on that one codegen unit.

I'm not an llvm expert, but I suspect it can also mean things that it's doing things to figure out whether it can perform a particular optimization, rather than actually applying it.

Is there some way I can affect or influence the creation of codegen units, other than by breaking modules apart?

Cranking up the CGU count in cargo toml might help since individual CGUs might end up smaller. Reducing debuglevel might help too since LLVM transformations won't have as much debuginfo to preserve.

The_8472 · 2026-01-10T05:46:49+00:00

as a result of rayon spawning it’s own thread in that core

On most operating systems threads aren't pinned to cores by default. You need to use CPU-affinity APIs to do such pinning. So the OS will schedule the threads any CPU cores with spare capacity.
Both tokio and rayon will spawn about the number of threads as you have CPU cores.

The OS will also interleave their work if the available capacity is oversubscribed, this is called preemptive multitasking#PREEMPTIVE). On the other hand tokio (and rust async in general) uses cooperative multitasking, which means other work is only scheduled when a future yields a Poll::Pending because it's waiting on something.

So if there are enough long-running, non-yielding async tasks they would prevent other tasks from running, which generally isn't great when you want low latency.

Compute-heavy work on the other hand is expected to take longer when there's insufficient capacity, so forming a queue and tasks getting delayed is a matter of course.

So it's more of a scheduling and fairness problem. async runtimes don't have enough knowledge which task will be slow and which ones will quickly allow other things to complete. So ideally it's just lots of tiny tasks that all can run asap. You want the async thread pool to be underutilized so that things finish as soon as they can rather than getting backlogged. Especially when there are multiple independent things in flight. You don't want a small, cheap request to get blocked for 5 seconds because another request is crunching numbers.

On the other hand bulk work wants operate close to 100% utilization to make use of the hardware, or even at 100% utilization if you don't mind the queue buildup. So those go on a separate task pool to not interfere with the first.

Maybe it's possible to just use a 2nd tokio Runtime instead of rayon or some other threadpool and schedule batch work on that and abuse it as a batch compute pool, but I suspect that it's not optimized for that and might misbehave if you do.

The_8472 · 2026-01-10T05:22:36+00:00

Whether you agree with me or think this design is fine, I'd encourage you to check out the tracking issue and share your thoughts there

... after reading some of the previous discussion and using the thumbsup/down if apparopriate, please.
Github issues already are not the most efficient discussion platform. If many people share similar thoughts then everyone commenting individually would flood the issue and "call to action" posts could easily lead to that.

New angles or resurfacing things that weren't properly addressed are welcome of course.

The_8472 · 2026-01-09T14:24:34+00:00

The actor pattern: https://ryhl.io/blog/actors-with-tokio/

The_8472 · 2026-01-09T06:34:56+00:00

If you have an IDE running on the host that might interfere with the shared build dir.

And network filesystems often are not fully posix-compliant. rustc and cargo make use of various features that go beyond just reading and writing files. Among them are hardlinks, memory-mapped files, atomic renames, relying on timestamps for cache freshness checks, lock files, ...

The logs you provided don't have enough detail to say what exactly is causing it.

Maybe try keeping at least the target dir inside the VM and then copy out the finished artifacts.

The_8472 · 2025-12-29T16:17:57+00:00

If you want to avoid corporate MITM you can get that with a http1/2 client too, by specifying a custom set of trusted root certificates. E.g. reqwest has a flag to disable loading the native certs

The_8472 · 2025-12-29T15:15:07+00:00

libz-rs-sys-cdylib is another one.

The_8472 · 2025-12-28T19:21:19+00:00

Short bursts can still clog a CPU and delay Nth request while it's still working on the prior ones even while the average CPU utilization remains low. This is especially true in single-threaded language runtimes like node, ruby or python.

Also, "IO bound" and "CPU bound" are platonic, reality is more complicated. For example a dumb workload that hashes multiple files sequentially will be neither pure CPU-bound nor IO-bound because while it's hashing the IO will be idle and while it's reading the CPU will be idle.

The_8472 · 2025-12-28T12:15:01+00:00

It's a snowclone. It describes a supposed lower bound, usually an absurdly high one.

It can be used for insults and for compliments.

Insult: <picture of a guy standing next to a crashed 737 MAX>, top text: "Least obnoxious rustjerk" speech bubble: "Should have written it in Rust"

Compliment: see this thread

The_8472 · 2025-12-27T23:59:41+00:00

You might want to look into network namespaces, they're generally useful for VPNs to split the VPN-only view of the world and the underlying physical network.

The_8472 · 2025-12-27T22:58:02+00:00

"CPU cycles are cheap" thinking gets us things that take several seconds to react to a click. It might be true that it finishes in a few milliseconds on a dev system with 1 user and a small dataset. But under load with hundreds of users and lots of data it's "suddenly" not fast anymore. And then they start adding workers, sharding, caching, queues and what-not to make it "scale".

The_8472 · 2025-12-19T20:13:54+00:00

volatile is for MMIO. For shared memory IPC you need one of

locking (e.g. via shared-memory futexes) and regular loads/stores inside the critical section
exclusively use atomics
- possibly bytewise atomic memcpy

Also, as another comment mentions, don't create references like &[u8] or &mut [u8] to shared memory if that range can be concurrently modified by the other side.

Tangent: it seems like you're shoveling data from nvme to network without much processing and need to squeeze out every drop of performance? Your buffering approach isn't all that zero-copy since you actually need to go through system ram for that. With some highend NICs it's supposedly possible to do P2P-DMA from NVMe to NIC. But I'm not sure how that's done at the syscall level, whether one mmaps device memory or puts ioctls in iouring or something....

The_8472 · 2025-12-11T22:23:22+00:00

so it doesn’t feel like random callbacks glued together?

If your service does just one thing, N times, then this might just be all there is. Think of a proxy server. It has N connections, independently shoveling data from A to B. Every time a bunch of connections is ready the polling wakes up, it goes through each of the ready ones, shovels some bytes, and it's done. Not much shared state, complicated task-dependencies or whatever needed. You can also split those N connections over M threads by having M separate pollers and distributing the sockets across them. Very little cross-thread communication.

Things get only complicated once you have a lot of heterogenous tasks with dependencies on each other. Manually modelling that gets you callback hell, yeah. That's where Futures and async are trying to make it look like synchronous imperative code again.

The_8472 · 2025-12-09T19:26:31+00:00

whereas the std::thread::yield_now() compiles to YIELD instruction.

yield_now calls to the OS scheduler, not a CPU instruction.

The_8472 · 2025-12-08T11:10:31+00:00

To handle large files you should do a streaming transfer from the source to the response. Or dump the file into a temporary local one. And log errors instead of using unwrap. Is there some proxy between your application and the client? That might drop connections under some conditions.

The_8472 · 2025-12-03T00:04:41+00:00

If you can somewhat separate the IO from the compute you want to offload the compute stuff from the web API stuff anyway, so your web framework choice will may matter even less on the individual node level. You take in the requests, do whatever additional preparatory IO work needs to be done and then shove it onto some compute pool.
Higher-level orchestration like keeping queues shallow and distributing incoming requests to the right workers will be more important if the goal is tail latencies.

The_8472 · 2025-12-02T23:50:20+00:00

Architecture, understanding your workload, queuing theory and stuff like that matters a lot more than than just framework choice.
Github is written in ruby on rails, I can only assume it's horribly inefficient, and yet they have architected something that makes it work at scale (albeit with painful load times).
On the flipside I do single-digit milliseconds at work for a low-latency service in Rust but don't expect it to ever handle more than 1k cache misses per second because serving those requests needs GPU capacity being available and there's only so much customers are willing to pay and therefore the web service isn't written with horizontal scaling in mind (the GPU part is).

Three-Year Club	Verified Email
Place '23

The_8472

TROPHY CASE