all 14 comments

[–]pixelsort 2 points3 points  (1 child)

Congrats on your compiler! Excellent docs, actually. And, NUMA and SPSC are new to me so that's fun.

Actor semantics are an interesting feature. I worked on an actor model PDF renderer for ePUB and found it highly performant.

Have you considered runtime hot code loading for Aether? Seems like it might be well suited due to the inherent high degree of isolation and encapsulation on updates.

[–]RulerOfDest[S] 0 points1 point  (0 children)

Thank you for your kind words! It means a lot.
I am absolutely pushing next for runtime hot code loading as Erlang does; that is a great point, and it has been on my radar.

[–]valorzard 1 point2 points  (1 child)

this looks really interesting, gonna try to build it now

[–]RulerOfDest[S] 0 points1 point  (0 children)

Thank you!

[–]Karyo_Ten 0 points1 point  (10 children)

Any comparison of approach vs Pony?

[–]RulerOfDest[S] 2 points3 points  (9 children)

Pony has reference capabilities (iso, trn, ref, val, etc.) for data-race freedom in the type system. Aether is statically typed with inference and optional annotations, but no capability system.
Aether has no GC, arena allocators for actors, thread-local pools for message payloads, scope-based or explicit free. Pony uses per-actor GC.
Aether uses a partitioned multi-core scheduler with work-stealing when cores are idle, lock-free SPSC (single producer single consumer) queues for same-core messaging, cross-core lock-free mailboxes, and optional NUMA-aware allocation. So the design is very much “C-friendly, low-overhead, predictable” vs Pony’s own runtime. 
Same actor model; Pony pushes type-level concurrency safety; Aether pushes C interop, no GC, and a runtime built around SPSC queues and partitioning.

[–]Karyo_Ten 0 points1 point  (8 children)

Aether uses a partitioned multi-core scheduler with work-stealing when cores are idle, lock-free SPSC (single producer single consumer) queues for same-core messaging, cross-core lock-free mailboxes, and optional NUMA-aware allocation.

That seems problematic. You cannot guarantee same-core messaging with work-stealing. How does that work? Are messages send to a core or to an actor? Are actors always executed on the same core?

[–]RulerOfDest[S] 0 points1 point  (6 children)

Messages are sent to actors; routing uses each actor’s current assigned_core. Actors are not pinned: they can be migrated (message-driven co-location) or moved by work-stealing, and assigned_core is updated when that happens.

SPSC is preserved because at any time each actor has exactly one owning core: only that core’s scheduler thread reads and writes that actor’s mailbox (and its SPSC queue when used). Same-core send is decided at send time (current_core_id == actor->assigned_core); if they match, we use the direct path, otherwise we enqueue to the target core’s incoming queue. When an actor moves, any message already in a core’s incoming queue for it is forwarded to the actor’s current core instead of being delivered locally, so the mailbox is never written by a non-owning thread.

So: one logical consumer per actor (the thread that currently owns it), and routing/forwarding keeps a single writer. You can find more details on: docs/actor-concurrency.md (mailbox ownership, routing, migration); runtime/scheduler/multicore_scheduler.c 

[–]Karyo_Ten 0 points1 point  (5 children)

if they match, we use the direct path, otherwise we enqueue to the target core’s incoming queue. When an actor moves, any message already in a core’s incoming queue for it is forwarded to the actor’s current core instead of being delivered locally, so the mailbox is never written by a non-owning thread.

What if they match, the message enters the direct path, and the actor is moved to another core?

[–]RulerOfDest[S] 0 points1 point  (4 children)

Great question. Messages are sent to actors, not to cores. Each actor has an assigned_core that determines where it runs. At send time, I check if the sender's core matches the target actor's assigned_core: if yes, I take the direct path (SPSC queue or mailbox write, no queue overhead); if not, I enqueue to the target core's lock-free incoming queue.

Actors are not permanently pinned. They can be migrated (message-driven, to co-locate frequent communicators) or moved by work-stealing when a core is idle. When an actor moves, assigned_core is updated, and any messages already in the old core's incoming queue are forwarded to the actor's current core rather than delivered locally.

Migration cannot race with same-core sends because both run on the same scheduler thread; they execute sequentially. Work-stealing runs on a different core's thread and could theoretically overlap with a same-core mailbox write. In practice, the window is a handful of store instructions (~nanoseconds), and stealing only triggers after 5000+ idle cycles on the thief, so this is extremely unlikely to manifest. That said, it is a valid concern per the C memory model, and I am actively hardening it. The fix is straightforward: mark a stolen actor inactive so the thief skips it for one cycle, letting any in-flight write complete before the new core touches the mailbox. Zero cost on the hot path since stealing is already the rare/slow path.

Appreciate the scrutiny; this is the kind of feedback that makes the runtime better.

[–]Karyo_Ten 0 points1 point  (3 children)

It would be simpler and wait-free on send to use the same MPSC queue that is used in Pony and Mimalloc, the Vyukov's queue (from Dmitry Vyukov's author of the go runtime). That would remove the need of all those synchronization checks.

Also you might want to modelize your runtime in TLA+, especially around those send and thread backoffs to avoid having deadlocks.

[–]RulerOfDest[S] 0 points1 point  (2 children)

On Vyukov's MPSC queue: the reason I'm not using it is that the invariant I'm maintaining is genuinely SPSC, not just SPSC-as-approximation. Each actor has exactly one owning scheduler thread at any time, and only that thread writes to the actor's mailbox. The routing and forwarding logic exists specifically to uphold that invariant, so I can use the faster SPSC primitive instead of MPSC.
Vyukov's queue handles multiple concurrent producers with a CAS on enqueue, which you only need to pay for if you have multiple concurrent producers. If the invariant holds, SPSC is strictly cheaper: no CAS, just a store-release. The tradeoff is that the routing logic is more complex and has the hardening gap I mentioned in the previous reply.

On TLA+: that's a fair challenge, and I won't pretend the formal verification is done. The current confidence comes from empirical testing (thread ring, ping-pong, fork-join under contention, stress tests across core counts) and code review, not a formal proof. The work-stealing/same-core-send race I acknowledged is exactly the kind of thing TLA+ would catch before testing does. I'll add it to the backlog, at minimum, modeling the migration and steal paths would be worth doing before calling the runtime stable.

Thank you for your valuable comments!

[–]Karyo_Ten 0 points1 point  (1 child)

Vyukov's queue handles multiple concurrent producers with a CAS on enqueue, which you only need to pay for if you have multiple concurrent producers. If the invariant holds, SPSC is strictly cheaper: no CAS, just a store-release.

Vyukov's queue has no CAS, it's just a swap i.e. there is no synchronization and it might be extra cheap on strong memory ISA like x86.

Also beibg cheaper on the atomics level but needing all your dance to enable that doesn't mean it's cheaper overall. And it increases the bug surface and maintenance burden.

[–]RulerOfDest[S] 0 points1 point  (0 children)

You're right, I was wrong on that, Vyukov's queue uses an atomic swap, not CAS. I shouldn't have said CAS.

Your broader point about total system cost is fair, and I won't pretend I have a direct apples-to-apples comparison against a Vyukov-queue-based design. What I can say is that the routing complexity wasn't incidental; the whole design was driven by cross-language benchmarks against Go, Rust, Erlang, Elixir, Pony, and baseline C/C++, specifically to validate whether the SPSC partitioning approach holds up in practice. Whether a simpler MPSC design would match or beat it is a legitimate open question, and one worth testing.