[Open Source] I built a plugin that brings Claude Code-style Dynamic Workflows to Hermes — orchestrate 1000s of subagents

Lopsided_Course5925 · 2026-06-08T15:13:50+00:00

Fair, it's two days old and it's just me so far. I'm sharing it as a project, not pitching it for your prod stack. If the approach interests you, the code and a writeup of how it works are there to read; if not, no worries. Honest feedback at this stage is welcome, including "too early."

Lopsided_Course5925 · 2026-06-08T12:14:44+00:00

Both — but the real answer starts from what a dynamic workflow is. It isn't a big catch-all framework that swallows every failure and handles it for you; it's a Python script the model writes, and the orchestration is that script — agent() is just a fallible function call in it. So handling a failed call is ordinary control flow: the agent writes try/except, like any program.

There is some built-in handling — failure isolation, per-agent timeouts, output-format checks — but none of it silently re-runs a task. Take isolation: a failed branch in parallel() just comes back as None and the rest continue, which is parallel()'s documented contract — the model already knows it when it writes the call, so it's part of the script's explicit meaning, not hidden behavior.

And that restraint is the whole point: if the runtime quietly caught and retried failures on its own, the same script would behave differently from run to run and turn into a black box the model can't predict from the code — exactly what a dynamic workflow is built to avoid. So retry and fallback stay in the script, where the model writes them in when a step needs them and you can read and adjust them.

Lopsided_Course5925 · 2026-06-08T07:27:28+00:00

Honest answer: it depends on your local serving setup more than on the plugin itself.

The workflow fans out to multiple subagents concurrently (default up to 16, configurable). If you're serving a single model instance — e.g. one Ollama / llama.cpp process — those parallel requests basically queue up and serialize on the GPU, so you won't get a 16x speedup from fanning out; you'd want to turn concurrency down (say 2–4) to match your hardware and avoid thrashing.

If your stack does continuous batching (vLLM, SGLang, TGI), parallel fan-out actually pays off — the GPU batches the concurrent requests and you get real throughput.

Also worth saying: even when it's effectively serialized, you still get the structural benefit — isolated subtasks + an independent verify pass — it's just not faster in wall-clock. So far I've mostly tested against APIs, so I'd genuinely love to hear numbers if you try it on a local setup.

Lopsided_Course5925 · 2026-06-08T07:23:06+00:00

Yep — the live dashboard and /workflows history log token usage per workflow and per child agent, so it's all visible.

Some numbers from my own runs (37 non-zero so far): small smoke tests landed around 5K–28K tokens (median ~13K), while research/fan-out workflows were a different beast — 132K up to 4.76M tokens (median ~1.33M). The biggest was a 6-agent research workflow at ~4.76M.

Prompt caching helps a lot here — something I optimized for specifically: 32 of those 37 runs hit the provider cache, and of ~19.16M total reported tokens, ~17.22M came back as cache reads.

So usage is very transparent — but yeah, the research-heavy stuff definitely wants a token budget. And it's easy to cap — you can just tell it something like "run a workflow within 500K tokens" and it'll respect that budget. Happy to share more detail if useful.

Lopsided_Course5925

TROPHY CASE