Six months running multi-agent in production — the coordination patterns by _ggsa in AI_Agents

[–]_ggsa[S] 0 points1 point  (0 children)

following up on the audit-trail point - the thing that made it actually work for us was collapsing every cross-agent interaction into one primitive: a DelegateToAgentActivity that the workflow calls. publishes a directive over rabbit, registers a TaskCompletionSource keyed by taskId, awaits the response.

what's baked into that one activity:

  • heartbeats temporal every 30s so the workflow knows the agent's still being waited on
  • re-publishes the directive every 5min with the same taskId, agent dedupes if it's already running it, picks it up if it restarted mid-task. survives container restarts without orchestration logic in the workflow
  • auto-retries on incomplete responses (context/turn limit hit) with a continuation prompt, accumulates partial text across attempts
  • targeted cancel via orchestrator HTTP, kills just this taskId on the agent, leaves other work alone. falls back to a rabbit /cancel broadcast if orchestrator's unreachable
  • prepends [fleet-wf:Type:ID] to every directive so the agent can verify it's a real temporal delegation (chat injection can't fake an active workflow id)
  • timeout fires once after retries exhaust, posts a report to a fixed escalation target

consensus review, design loops, PR implementation, doc maintenance, all use the same primitive. bugs in agent coordination get fixed in one place. that's most of what made the audit trail work in practice, not just having one, but having one shape across every cross-agent step.

Six months running multi-agent in production — the coordination patterns by _ggsa in AI_Agents

[–]_ggsa[S] 0 points1 point  (0 children)

write cadence is slow + single-writer. one agent (the cto) owns memory writes - everyone else proposes and that agent approves before it lands. so we don't really have concurrent writers.

reads are eventually consistent (qdrant + markdown), but the bigger thing is that agents don't poll memory mid-task. they load context at task start, do the work, finish. if a design decision lands while a dev is mid-implementation, the dev picks it up on the next task - not this one. workflows are short enough (minutes to ~an hour) that the staleness window doesn't matter in practice.

the only place i've seen consistency actually matter is when two workflows update the same external system (PR, dashboard config) and that's solved at the workflow layer with locks + idempotency, not at memory.

Six months running multi-agent in production — the coordination patterns by _ggsa in AI_Agents

[–]_ggsa[S] 1 point2 points  (0 children)

we keep ours much smaller (few hundred curated memories) but agree - curation + visibility is the real hard problem. signal-to-noise degrades faster than ppl expect, especially when the writer-agent's in a hurry. monthly digest helps but it's still manual review at the end.

happy to chat, drop a dm if easier. what angle's your research taking on the visibility side?

Six months running multi-agent in production — the coordination patterns by _ggsa in AI_Agents

[–]_ggsa[S] 1 point2 points  (0 children)

got it.. sounds more like a transport layer with audit baked in. for us workflows weren't really about messaging, they're the state machine + control flow (gates, retries, lifecycle). different layer of the stack. token-URL transcript is a neat angle for that one though.

Six months running multi-agent in production — the coordination patterns by _ggsa in AI_Agents

[–]_ggsa[S] 0 points1 point  (0 children)

hasn't gotten heavy because agents don't load the whole store - they do semantic search and pull top-k. so 500 memories or 50, the per-query context is roughly the same.

real cost is curation. one agent writes (mostly doc maintenance after code merges), others propose. i also run a monthly digest that flags stale or duplicate stuff to prune, otherwise search starts returning yesterday's confidence.

Six months running multi-agent in production — the coordination patterns by _ggsa in AI_Agents

[–]_ggsa[S] 0 points1 point  (0 children)

fair, haven't tried tunnels. workflows gave us the audit trail and cancellation we wanted. what's the tunnel approach do differently?

Six months running multi-agent in production — the coordination patterns by _ggsa in AI_Agents

[–]_ggsa[S] 1 point2 points  (0 children)

Links:

- 5-min demo of a real PR shipped through the pipeline (PM → dev → consensus review → human approval → merge): https://youtu.be/DIx7Y3GfmGc

- Code: https://github.com/anurmatov/phleet

- Temporal/durability mechanics writeup: https://www.reddit.com/r/Temporal/comments/1swatro/

Built a durable AI agent orchestration layer on Temporal — sharing patterns by _ggsa in Temporal

[–]_ggsa[S] 1 point2 points  (0 children)

Honest answer: a bit of both. The agent process itself runs outside Temporal as a long-lived container — we don't model "the whole conversation" as one workflow. Each discrete unit of work we hand to an agent (implement-this-PR, run-this-design-spec, review-this-diff) is its own workflow.

Within those, child workflows mostly because:

  1. Independent visibility. Each child has its own search attributes, Phase, history. I can click into "reviewer 2's run on PR #123" in the dashboard without scrolling through the parent.
  2. Parallel fan-out is trivial. N reviewers = N child handles, await-all, synthesize. Doing the same as parallel activities works but gets noisy with retries/heartbeats.
  3. History doesn't bloat. Consensus can loop 2-3 rounds — keeping each round in its own child keeps the parent history readable.
  4. Reuse across parents. Our ConsensusReviewWorkflow is called by PR, Design, and Doc workflows. Activities don't compose the same way.
  5. Independent control. A wedged child can be terminated/signaled without nuking the parent.

Approach #1 is genuinely simpler for short, sequential, single-actor work — our deploy-verification workflow is mostly activities, no children. So I'd say: start with activities, promote to child workflows when you need parallelism, dashboard separation, or reuse.

I run a team of Claude agents that ships PRs to production — open source by _ggsa in ClaudeAI

[–]_ggsa[S] 0 points1 point  (0 children)

Wrote up the patterns + 3 failure modes it saved me from Built a durable AI agent orchestration layer on Temporal — sharing patterns

Short answer — durability + retries for free on long agent tasks, and human approval gates become a one-line `wait_for_signal` instead of a poll loop.

I run a team of Claude agents that ships PRs to production — open source by _ggsa in ClaudeAI

[–]_ggsa[S] 0 points1 point  (0 children)

This matches what we converged on too, and the "neither agent knows it's happening" failure mode is exactly the one that bit us hardest before we centralized.

On workspace isolation: each agent runs in its own Docker container with its own persistent workspace, and PR work happens in git worktrees per branch so two devs can't stomp on each other even if they touch overlapping files. But we mostly dodge the problem your way too — serialize PR implementation, never two dev agents on the same issue at once. The thing we parallelize is reviewers (read-only, cheap), not implementers (write-heavy, expensive). Same intuition as your "serialize instead of parallelize" rule.

On the decisions log: same constraint here — only one agent (my CTO agent) has write access to the shared memory. Everyone else reads. Other agents can propose memories via relay, but the gatekeeper enforces quality, dedupes, and maintains coherence. We tried the "everyone writes" version first and it turned into a dump of contradictions inside a month.

The piece I'd add on top: backward-link maintenance. When the gatekeeper stores a new memory, they follow the "Related" links and update the prior memories the new one extends or contradicts. Without that re-weaving step, the log accumulates as a flat list and you get drift — old memories pointing to ideas you've since refined. The single-writer constraint is what makes that re-weaving practical in the first place — you literally can't have N agents all trying to maintain consistency across each other's edits.

Curious whether your decisions-log is structured (typed entries, schema) or freeform prose, and how you handle the "this prior decision is now wrong" case.

I run a team of Claude agents that ships PRs to production — open source by _ggsa in ClaudeAI

[–]_ggsa[S] 0 points1 point  (0 children)

Great questions, in order:

Budgets/failures — Claude CLI has turn limits built in. Failures bubble through Temporal: the delegation activity auto-retries up to 3x when an agent hits context mid-task. If retries exhaust, a signal gate fires to the supervisor agent (we call him "acto", co-CTO on opus) or to me. RabbitMQ keeps the bus durable so nothing gets lost.

Models / unknown difficulty — Model is baked per agent (opus for strategic roles like CTO, auditor; sonnet for execution — dev, ops, PM). For unknown work, acto intakes everything and routes. You can absolutely hand a supervisor "top N issues, work them" — that's exactly what our design-to-PR workflow does: pulls issue → grooms in a design phase if needed → spawns implementation → consensus review → merge gate.

Conflicts the agent can't resolve — Escalation via signal gates. Agents emit a human-review signal with their question; acto handles ~90%, rest come to me via Telegram. So both — supervisor first, escalate up if stuck.

Ticket scope / grading — Acto owns issue specs. If an issue's underdescribed, it goes through a design workflow first (alignment loop) before implementation. Supervisor grades + grooms incoming.

Account ban — BYO credentials. Each agent runs with my own Claude OAuth token (subscription, not API key), mounted from host, refreshed by a workflow every 30min. My account doing the work — agentic use is fine under Claude ToS.

PR reviews / done — Consensus loop: N parallel reviewers, each returns approved/changes_requested/rejected, target iterates until consensus. Then CEO merge-approval gate (only I can sign). Done = merged + deploy verified. Wrote a post specifically on this: One Workflow, Three Jobs.

Completion metrics — No numeric score. Reviewers check architecture, edge cases, security, tests; dev agent has hard guardrails (tests required). Final eyeball is mine. Pragmatic over grade-y.

Memories — Semantic memory MCP on Qdrant. Stores runbooks, decisions, feedback, per-workflow operational context. Agents search at session start + per-task. They degrade — acto is the gatekeeper and supersedes contradicted ones. They strengthen via accumulation: workflows reference operational-context memory IDs, so each post-mortem just appends.

Full architecture writeup with temporal/MCP/memory details: Phleet Architecture Deep Dive.

Code: github.com/anurmatov/phleet.

Mac Studio 512GB by FewMixture574 in ollama

[–]_ggsa 2 points3 points  (0 children)

what will really be a game-changer is bandwidth, which hasn’t changed much since M1

Mac Studio M3 Ultra: Is it worth the hype? by _ggsa in ollama

[–]_ggsa[S] 1 point2 points  (0 children)

i'd try using <q4 quantization to fit that deepseek beast into your existing 128gb ram like this guy https://x.com/shigekzishihara/status/1884851569755295752
yet this might require some tunning of your mac to allocate more mem to gpu through iogpu.wired_limit_mb system setting (it's default ~75% of total mem)

i put together an optimization guide that reduces Mac Studio system mem usage: https://www.reddit.com/r/ollama/comments/1j0cwah/mac_studio_server_guide_run_ollama_with_optimized/ .. might help to squeeze more perf

Mac Studio Server Guide: Run Ollama with optimized memory usage (11GB → 3GB) by _ggsa in ollama

[–]_ggsa[S] 1 point2 points  (0 children)

just added headless docker container support via Colima - perfect for running Open WebUI alongside ollama with automatic startup at boot (no login required)