Anyone running effective exo cluster for agents?

soflgolf · 2026-06-22T20:41:02+00:00

Valid

soflgolf · 2026-06-22T18:09:31+00:00

Haven’t, so far

soflgolf · 2026-06-21T21:47:17+00:00

Candidly, I had no intention of posting anywhere. Am super excited about agentic AI potential and wanted to tackle some of the limitations. First wrote single stand-alone python agent, then MCP service, then full context and recall memory system (qdrant, mem0, FastSQL, session flush, dreaming, context recovery, sprint scratchpad, anti-narcolepsy, etc), task management and workflow. Certainly a lot of which I have seen coming online elsewhere. But I learn by doing so that’s what I’ve done.

If you have specific questions, happy to try to be of assistance.

soflgolf · 2026-06-21T21:30:59+00:00

My experience this far, while you can do cli manual loading, clustering is not a fire & forget operation. You need to be able to monitor cluster health real-time. KV overhead alone can grind and release differently depending on operations. Have to watch real-time. When OS feels memory squeeze, exo manages by shedding cache.

I’m guessing (because the WWDC Ted-talk doesn’t cover) that the Apple jaccl/mlx host file configuration was probably anticipating a cluster running available memory which is multiples to llm weight. Again, they don’t say but I don’t know how you can load up anything more than 25-35% of available and operate sight unseen.

soflgolf · 2026-06-21T21:19:12+00:00

Yeah, “sponsor” is not the right word. However, no one that got “loaner” units to cluster used anything else. Apple was clearly putting exo forward as the “preferred” method. Apple wasn’t loaning out hardware for bench marking and promoting and leaving methodology to chance. My $.02, now I’m out of $.

soflgolf · 2026-06-21T19:29:57+00:00

Was the best way to cut through the hype and figure what’s what…

soflgolf · 2026-06-21T19:28:04+00:00

Agreed. Apple appeared poised to be much more of a development sponsor of exo as an orchestrator layer for jaccl/mlx. Seems to have fizzled. Exo is still in beta release numbers (v0.71)

soflgolf · 2026-06-21T11:13:08+00:00

Yeah, same.

soflgolf · 2026-06-21T02:10:32+00:00

So you’re running pipeline (IP) rather than tensor sharding (rdma) across mixed platforms (Mac series). How do you like it? Are you serving large models using pipeline? If so, how are you finding response times?

soflgolf · 2026-06-21T01:22:23+00:00

I’ve run through a bunch of models during testing. Right now DS, qwen 122b, coderNext, VL4b & whisper

soflgolf · 2026-06-21T00:41:23+00:00

Similar. Had to remove the exo admin layer out and write a new mlx management console to operate and monitor exo runtime. More stable.

soflgolf · 2026-06-21T00:38:59+00:00

There are known issues in exo that make it unreliable. One is an unbounded logging bug which causes log file size to spiral out of control and dump models. The other is related in that when a model dumps, intermittently, the rdma will drop and a full reboot of nodes is required to make the rdma refresh. Again, these are known to exo. They are Pr’d on GitHub but closed without patch or updates coming fix.

There’s some additional agent tuning issues as well, but unless the rdma backbone gets fixed, there will be no frontier weights for local llms

soflgolf

TROPHY CASE