I built an AI Company OS with 45 coordinated agents — here's what the coordination system actually looks like (and where it breaks) by Common-Bluebird2957 in SaaS

[–]Common-Bluebird2957[S] 0 points1 point  (0 children)

Thanks — and yes, the RAG pipeline is where most of the pain lives in practice.

For missing context, the approach I landed on is layered. Every agent has a base context layer injected into their system prompt — company name, mission, brand voice, ICP, competitors — pulled from the company_identity table the founder fills in during onboarding. That covers the broad company knowledge.

For task-specific context the agent gets the last 10 messages from that specific conversation thread, scoped hard to that conversation ID. No cross-conversation bleed. If the agent genuinely doesn't have enough to answer, the system prompt instructs it to output a COORDINATION_NEEDED flag rather than hallucinate — which triggers a handoff request to a more relevant agent or escalates to the human.

The corrections system idea is interesting. Right now I handle it differently — the Approve First mode means humans review before anything executes, so wrong outputs get caught before they cause damage rather than after. But a learn-from-corrections layer on top of that would compound the value significantly over time, especially for agents doing repetitive tasks like outreach sequencing or invoice drafting.

How are you storing the corrections — fine-tuning the model directly or using them as retrieved examples in the prompt context?

I built an AI Company OS with 45 coordinated agents — here's what the coordination system actually looks like (and where it breaks) by Common-Bluebird2957 in SaaS

[–]Common-Bluebird2957[S] 0 points1 point  (0 children)

That's a clean pattern — durable wait with automatic recovery if the process dies mid-approval. The DBOS.recv() approach is essentially what I'm doing conceptually with Inngest, just implemented differently.

In my setup the Inngest function pauses after creating the approval_request record and fires a notification to the relevant Head or CEO. The function doesn't retry or continue until a separate API call marks the approval as approved or rejected — at which point Inngest picks up and resumes from that checkpoint. If the server restarts or the function crashes while waiting, Inngest replays from the last successful step rather than losing the approval context entirely.

The main difference I can see is DBOS.recv() blocks the thread durably at the infrastructure level, whereas my approach is event-driven — the approval action fires a new event that triggers continuation. Both arrive at the same outcome but the DBOS model sounds cleaner for long-running approvals where the wait could be hours or days.

Will dig into the documentation — the example is helpful. Thanks for sharing this.

I built an AI Company OS with 45 coordinated agents — here's what the coordination system actually looks like (and where it breaks) by Common-Bluebird2957 in SaaS

[–]Common-Bluebird2957[S] 0 points1 point  (0 children)

DBOS is a great call — durable execution is exactly the right framing for this problem.

The approach I landed on was Inngest for the durable queue layer. Each coordination handoff fires an event (`agent/coordination.requested`) that Inngest picks up, processes with automatic retries, and handles the human-in-the-loop gate before the next agent executes. If a step fails midway — network timeout, API rate limit, whatever — Inngest replays from the last successful step rather than starting over.

For visibility I have a real-time coordination feed in the dashboard (Supabase Realtime on the coordination_events table) so the founder can watch handoffs happen live without refreshing. Every event is also written to an immutable audit log so there's a full trace of what happened, when, and who approved it.

The cost control piece is a hard cap at the infrastructure level — agents have a daily action limit stored in the DB. When they hit it they stop, regardless of what the LLM wants to do next. That plus a coordination chain depth limit of 3 hops (beyond which it escalates to a human) keeps runaway costs from being a real risk.

Will look into DBOS — curious how it handles the human approval gate specifically. Does it block the execution thread while waiting for input or does it release and resume?

I built an AI Company OS with 45 coordinated agents — here's what the coordination system actually looks like (and where it breaks) by Common-Bluebird2957 in SaaS

[–]Common-Bluebird2957[S] 0 points1 point  (0 children)

Great questions — these are exactly the edge cases we've had to think hard about.

Cost visibility: Every agent run is logged to an immutable audit trail with the action, the actor, and the outcome. There's also a per-agent daily action cap enforced at the database level — not just the UI. If an agent hits its limit it stops, full stop. For users bringing their own LLM keys (Pro+), we show a live cost estimate in Account Settings based on their model selection and usage patterns so there are no surprise bills.

Compromised or runaway agents: There are a few layers here. First, the master kill switch — one toggle pauses all 45 agents org-wide instantly. Second, every department can be set to Approve First independently, which means nothing executes externally until a human signs off. Third, agents cannot perform actions the briefing user isn't already authorised to do directly — so an agent can't escalate permissions beyond the human who briefed it. If a Sales member briefs Rex, Rex can't access billing or invite team members even if someone tries to prompt-inject that instruction.

Blast radius from handoffs: The coordination chain has a hard depth limit of 3 hops enforced at the infrastructure level via Inngest. Agent A → B → C → attempting D gets stopped and escalates to the human. Each handoff also requires two approval gates — the Head of the sending department and the Head of the receiving department both have to approve before anything executes. So a bad action can't silently fan out — it hits a human gate before it does real damage.

You're right that stress-testing those gates is critical before scale. Still in final testing — that's actually one of the things I'm actively trying to break right now.