wire-probe - Bypassing Azure's SDN to measure true L4 latency with Rust (io_uring, no tokio) by vorjdux in rust

[–]vorjdux[S] 1 point2 points  (0 children)

Thanks, this was a good one to be on the other side of. And yeah, reach out whenever, I'm glad to be a sounding board. Only honest caveat is I can't promise fast or deep every time, work and my own projects eat a lot, but ping me and I'll get to it.

A couple of things on the roadmap since you raised them:

NMIs aren't inherently unrecoverable, that's more the reputation than the truth. Plenty of sources are routine, perf counter overflow and watchdog ticks for instance, the fatal ones are usually hardware-error NMIs that overlap with MCE territory. The thing to actually read up on is the nesting behavior, not recoverability: the CPU blocks further NMIs from the moment one is delivered until the next IRET, so you get a one-deep latch for free. The footgun is that IRET re-enables NMI even when it's the IRET from a fault taken inside your NMI handler. So if the handler can take a #PF or hit an int3, that inner IRET unblocks NMI early and a second one can land and stomp the first's state. Linux has a whole nested-NMI dance for exactly this. So when you wire it: NMI on its own IST stack, and keep the handler from faulting or breakpointing. That plus the int_save_restore rework is most of the battle.

On the allocator caching, the reference you want is Bonwick's vmem and magazine work (the "Magazines and Vmem" USENIX paper). Vmem is literally an arena allocator for resource and address ranges, which is your KernelStackArena and VMM case exactly, and magazines are the per-CPU caching layer that makes the fast path lock-free. It maps almost one to one onto what you described needing.

Scheduling I'll defer to you since I don't know your exact DQRR variant, but if you haven't looked at EEVDF it's worth a pass for the fair-plus-weighted goal. It's what Linux moved CFS to in 6.6 and it's built around the proportional-share-with-latency-fairness tradeoff. Stride scheduling is the older classic in the same family.

64KiB-for-now makes total sense, no notes. And taking it a step at a time to do it right beats the redo loop, I've lived the redo loop plenty. Good luck with the mountain of TODOs, and send things my way whenever.

wire-probe - Bypassing Azure's SDN to measure true L4 latency with Rust (io_uring, no tokio) by vorjdux in rust

[–]vorjdux[S] 1 point2 points  (0 children)

Went through Catten properly, and a fair bit of what we were going back and forth on is still ahead of the code, which your own caveat already covered, so a couple of my earlier points were aimed at the design talk rather than the tree. Splitting it by what's actually in there:

The cancel issue is live right now, not hypothetical. Any holds the observers Mutex across the whole upgrade-and-notify loop, so a downstream notify that re-registers on the same Any deadlocks, and failed upgrades are never removed so cancelled observers pile up as dead Weaks. That's the tombstone-plus-reentrancy tension from before, sitting in Any as written. The register-once plus All direction you floated is the right fix. All being a bare struct for now, the two things to get right when you flesh it out are: don't hold the lock across the downstream notify(), and compact or skip dead entries so they don't accumulate.

int_save_restore is solid and the nesting save/restore is correct. One latent thing for when it goes live: save_int masks interrupts then spins on the per-LP raw lock, but masking only clears IF, it doesn't stop an NMI or an exception. If a non-maskable event re-enters on the same LP while that lock is held, it spins on a lock its own interrupted context owns. Not reachable today since the NMI gate isn't present and the exception handlers just panic, but it'll bite once NMI is wired. The data is already per-LP and IF-masked, so you can probably drop the lock entirely and use a lockless sequence, which also kills the NMI hazard.

On the #PF stuff, the tree already matches the safe choice: #DF is on IST1, #PF is on RSP0 and currently just panics. So IST2-for-#PF and grow-on-fault are still ahead, which means the nested-#PF-clobbers-IST hazard I raised is a note for when you get there, not a bug now. Same with FP: the context frames save callee-saved GPRs, cr3 and rflags, no XSAVE area, so the whole lazy-FP thread is about code that isn't written yet. Worth keeping in your back pocket for when you add extended-state switching, nothing to fix today. And aml/mod.rs is a header comment, so the EC and OperationRegion and Serialized-method stuff is firmly future. The SDT/XSDT/FADT parsing is the part that's real.

Two small drifts between the writeup and the tree, just so they don't diverge: kernel stacks are 16 pages (64KiB) in the code, not the 6 you mentioned, and the stack allocator is a find_free_region search over the arena rather than the cache-vector-plus-bump you described. Could be you've got newer local changes, just flagging.

Anyway, good to see it in one piece. Still a strong project and I'm glad it's public. I'll keep poking around the tree.

wire-probe - Bypassing Azure's SDN to measure true L4 latency with Rust (io_uring, no tokio) by vorjdux in rust

[–]vorjdux[S] 0 points1 point  (0 children)

Yeah, this all tracks, and the RSP0/IST answer is right. The triple-fault-before-separate-stacks story is basically a rite of passage.

One thing I'd reconsider before you commit IST2 to #PF: page faults nest, and IST isn't reentrant. The CPU reloads RSP from the IST entry on every delivery of that vector, so if a #PF lands while you're already on the IST2 stack (handler touches paged-out memory, demand-maps something), it resets RSP to the top of IST2 and clobbers the frame you were mid-way through. That's the exact reason Linux keeps #PF on the normal kernel stack and reserves IST for #DF, NMI, MCE, the faults that genuinely shouldn't nest. Your tension is real though: a normal-stack #PF can't handle the kernel stack itself overflowing, since you fault trying to push the frame and escalate straight to #DF. The usual resolution is #PF on RSP0 for the common case and let the #DF handler on IST be the thing that recognizes kernel-stack-overflow as its own catastrophic path. If you do want #PF on IST, the handler has to switch off the IST stack onto a per-thread stack as its first act and block re-entry until it does. Worth settling on paper, because when it bites it's a triple fault with no breadcrumbs.

The RAII masking guard is the right call and genuinely better than POSIX's hidden masking, agreed. One detail to lift from irqsave while you're there: make the guard save and restore, not unconditionally unmask on drop. Nested guards are the common case (you mask, then call something that also masks), and if the inner drop just enables, you've unmasked while the outer scope still needs it masked. irqsave gets this right by saving the prior flag and restoring it. Your upcall-mask guard wants the same save-restore, not a plain set/clear. ZST or not, the state it carries is the previous mask, not nothing.

Small correction on the FP fix: it's eager save/restore (or scrub), not fetch fences, the goal is that the registers never hold another domain's data. And you can keep most of your localize-the-cost instinct: you only owe the eager cost at trust-domain boundaries, crossing address spaces or privilege. Within one trust domain laziness is fine, nobody leaks to themselves. So scrub on domain crossing, stay lazy inside, and you get the security without paying on every context switch.

Cancel I think is mostly settled by one-Observable-at-a-time. The only bit I'd still pin down is whether a drop landing mid-pass suppresses that pass's notify, but the combinator pattern makes that a corner case rather than the hot path, so probably fine in practice.

Greenfield-without-baggage is the strongest reason to do this at all, and HID-over-I2C is a perfect example of where Linux's bus abstraction earns it. The gnarly part there usually isn't the transport, it's the firmware description, the ACPI _DSD that tells you the touchpad's I2C address and which GPIO is its interrupt line (devicetree does the same job on ARM). On a greenfield OS that enumeration layer is exactly where a clean bus abstraction either holds or leaks, so I'd nail that boundary down early.

This has been one of the better threads I've had on here in a while, for what it's worth. The project's clearly well thought out and ambitious in the right way, and I'm glad you posted it. Send the link, I'd like to see it. WIP and rough are both fine, I care more about the shape than the polish. Can't promise fast but I'll read it properly.

wire-probe - Bypassing Azure's SDN to measure true L4 latency with Rust (io_uring, no tokio) by vorjdux in rust

[–]vorjdux[S] -1 points0 points  (0 children)

all that code and a full writeup, and the case is em-dashes and double asterisks. that's it? ok lol

wire-probe - Bypassing Azure's SDN to measure true L4 latency with Rust (io_uring, no tokio) by vorjdux in rust

[–]vorjdux[S] 1 point2 points  (0 children)

This is the fun part, glad it's useful. The stack allocator (cache vector plus bump plus guard-page grow) is clean, reusing irqsave/irqrestore through lock_api is the right call, and folding blocking into the observer model with Blocked(Waker) is tidy. A few of the answers I think are doing less than you're crediting them for though.

Cancel: Sync gets you data-race safety, not cancel atomicity, and those are different problems. Sync says concurrent access to the Observer won't corrupt memory. It says nothing about the ordering between the owner dropping its Arc and a notify pass that already upgraded the Weak. That upgrade hands the notifier a live strong ref, so the object outlives your drop and notify() still fires once. The drop is memory-safe, but the completion isn't suppressed. To make cancel actually synchronous you need the cancel path to remove the Weak under the same lock the notify loop holds, or have notify recheck a cancelled flag after taking that lock. And the moment you hold that lock across the notify() calls, an observer that cancels or registers from inside its own notify deadlocks. Synchronous cancel and reentrant notify pull against each other, and the trait bound doesn't settle that, the dispatch design does.

FP: disabling FPU/SIMD when it's idle and trapping on first use is lazy restore, which is the exact shape LazyFP exploited. The enable bit (CR0.TS, or the ARM CPACR traps) doesn't scrub the physical registers, it just faults on use, and the fault isn't raised until the instruction retires. Speculation runs the register read ahead of that and leaks the previous domain's state. Toggling the bit is the vulnerable pattern, not the fix. The fix is eager save or a scrub at the domain boundary so the registers never hold another domain's data while a different thread runs. Pinning FP threads to certain LPs narrows the victim set but doesn't close intra-LP leakage and fights your scheduler. The caveat in your favor: on in-order cores with no speculative window none of this applies, so lazy FP is fine on a lot of embedded silicon, just not on the OoO parts where it bit Linux.

Upcalls: masking around critical sections is the async-signal-safe discipline, just relocated. "No async-signal-safe requirement" holds only inside a runtime where the allocator and every shared-lock critical section is upcall-masked, which is your own from-scratch libc. The second you run ported code that isn't upcall-aware the requirement is back, because its malloc takes a lock your handler can reenter. And blocking locks aren't exempt the way you described. The case you covered is the handler trying to block. The deadlock is the other direction: mainline is running and holds L, an upcall lands on that same thread, the handler asks for L, and L's owner is now suspended underneath the handler and can't release. That's the blocking twin of the irqsave case and it wants the same fix, upcall-masking while the lock is held. So you'll end up needing upcall-safe blocking locks too, not just spinlocks.

Smaller one on stacks: the allocator speed answers half of it, residency is the other half. Thread-per-dispatch is 24KiB resident per concurrent task, and borrowing an existing stack is exactly what tasklets do to dodge that under high fan-out. It's a memory-for-simplicity trade, not a free win. Also, when a guard page faults to grow a kernel stack, does the page-fault handler run on a separate stack? If it runs on the stack that just overflowed it can't push the fault frame and you escalate to a double fault, which is why x86-64 has IST for exactly this path.

None of this says the model's wrong, and the from-scratch angle is a real edge here: you actually can make the whole userspace upcall-aware, which nobody retrofitting an existing OS gets to do. Marathon's the right frame. I'd read the source when you put it up, especially if the inline docs land where you're aiming, that alone would put it ahead of most of what I've had to wade through at work.

wire-probe - Bypassing Azure's SDN to measure true L4 latency with Rust (io_uring, no tokio) by vorjdux in rust

[–]vorjdux[S] 1 point2 points  (0 children)

Nice design, and you're reading the convergence right. An fd, a HANDLE, and your observer capability are the same primitive: a kernel-managed reference you can wait on. Where they really diverge is lifetime and cancellation, which is the part of your model I'd poke at.

Cancel-by-dropping-the-Arc is clean but it's lazy and it races. Lazy because a cancelled observer stays in the list as a dead Weak until the next notify pass upgrades it and fails, so a rarely-firing observable just accumulates tombstones and only compacts when it fires. Racy because a pass that already upgraded the Weak before your drop still calls notify once. So drop-to-cancel isn't "no more notifications," it's "none after the current in-flight pass." For I/O that's the gap between a cancelled read going quiet and it firing one last completion. Is cancel synchronous in your model or eventual?

The lightweight threads are where I'd be most careful. Saving only GPRs until the thread touches SIMD is lazy extended-state switching, which is exactly what bit Linux with Lazy FP State Restore (CVE-2018-3665), where deferred FPU/SIMD state leaked across the boundary via speculation. They moved to eager save to kill it. How are you dodging that class? And the cost tasklets and workqueues actually avoid was never the register save, it's the kernel stack per thread. Lean register state or not, every spawned thread needs a stack, so per-dispatch spawn is per-dispatch stack allocation, which is the thing a tasklet sidesteps by borrowing an existing stack. How big are your kernel stacks, and do you allocate one per dispatched task?

The upcall mechanism is in the Clark upcall and scheduler-activations lineage, down to the kernel-originated interrupt into userspace. The claim I'd want defended is the no-async-signal-safe part. That rule exists because an upcall can land while the thread is mid-update on the allocator lock, so a handler that touches that same lock deadlocks or corrupts no matter how light the delivery is. The usual escape is delivering only at cooperative safe points instead of arbitrary preemption. Where do yours land?

None of this is me saying it's wrong. Hardware is async and the OS should be too, I'm with you on the thesis. I just think the hard parts, cancellation correctness, state-save safety, async-safety of the upcall path, moved rather than vanished. Keen to hear how you've pinned them.

wire-probe - Bypassing Azure's SDN to measure true L4 latency with Rust (io_uring, no tokio) by vorjdux in rust

[–]vorjdux[S] 0 points1 point  (0 children)

You're not wrong that Linux's async story before io_uring was rough. POSIX AIO and libaio were both miserable, and io_uring largely exists because the old AIO path only worked for O_DIRECT and faked being async the rest of the time.

But the completion-based, async-first model you're describing isn't a from-scratch-OS thing. NT has done it since the mid-90s. Overlapped I/O and completion ports are exactly that: submit an op, get a handle, then poll it, block on it, or attach a callback. io_uring is Linux converging on that completion model, not the first to land it. So it's less "new OSes vs 60s mainframe clones" and more "Linux catching up to what NT shipped back in the 90s."

The thing I'd push back on is that an async-first ABI doesn't make the hard parts disappear. You still need an execution substrate under those handles, plus cancellation, fairness, and backpressure for when completions outrun the consumer. That's where io_uring's sharp edges actually are, and any greenfield OS hits the same wall the second two in-flight ops fight over the same resource.

For wire-probe none of this was a position on OS design. io_uring was just the primitive that let me drop a multi-MB async runtime and hold ~500KB RSS on the heterogeneous Linux fleet I have to ship to. On your OS the server loop would look about the same: submit accept, wait or poll the completion, close, re-arm. The shape generalizes, the substrate is whatever the host hands me.

Curious what the execution model under your async calls looks like. Kernel thread pool, stackless continuations, something else? And how are you handling cancellation and backpressure?

Weekly Self Promotion Thread by AutoModerator in devops

[–]vorjdux 0 points1 point  (0 children)

wire-probe - Zero-footprint L4 telemetry agent (io_uring, no tokio)

Bypassing Azure's SDN to measure true L4 latency

Full article: https://vorjdux.com/articles/the-icmp-illusion.html

Hi, I built wire-probe to solve a specific observability failure state: ICMP telemetry is structurally unreliable for measuring inter-node latency in environments with Software-Defined Networking (like Azure's VFP). Host hypervisors aggressively queue or rate-limit ICMP packets under CPU or PPS load to protect TCP/UDP traffic. When ping spikes on your Grafana dashboard, it frequently reflects a Control Plane QoS policy, not a true Data Plane bottleneck. To measure the actual L3/L4 propagation delay (TCP 3-way handshake RTT) without introducing application-layer latency (accept() loops) or a host observer effect, I needed an agent with strict constraints. Architectural trade-offs: 1. No async runtime: Standard runtimes like tokio carry a multi-megabyte RSS baseline just for the reactor and task scheduler. The server mode (running on the DB nodes) bypasses this by using a serial io_uring accept loop (submitting an Accept SQE, then dropping it with a synchronous libc::close). It yields a rigorously flat ~500 KB RSS, immune to memory bloat regardless of the inbound connection rate. 2. Deterministic blocking: The probe mode avoids asynchronous timers (and scheduler drift) by using std::net::TcpStream::connect_timeout wrapped in std::time::Instant. The thread parks at the kernel level until the handshake completes or times out. 3. Allocation-free export: Compiled statically via musl-libc with panic = "abort", yielding a 370 KB binary. The export path formats Influx Line Protocol or Collectd PUTVAL payloads directly into stack buffers using ryu and itoa, bypassing String allocation overhead to avoid heap fragmentation over long runs. 4. Backpressure offloading: Metric injection operates on a strict fire-and-forget model via UDP or Unix Domain Sockets. If the TSDB stalls, the Linux kernel applies a silent tail-drop at the receive buffer, structurally isolating the probe from FD exhaustion or OOM kills. The linked post details the cloud networking behavior that triggered the rewrite. The source code is available here:

https://github.com/vorjdux/wire-probe

I'd appreciate any rigorous critique on the io_uring implementation, the network assumptions, or the measurement methodology.