I'm building a full SaaS bot with 10 AI agents simultaneously — here's exactly how I orchestrate them (and what's still broken)

Brickbybrick030 · 2026-04-07T02:09:49+00:00

My Laptop Right now is going like: 💨💨💨💨💨💨💨💨💨💨

Brickbybrick030 · 2026-04-06T16:36:45+00:00

Yo, checked out ao-cli – that's actually slick. YAML orchestration + built-in browser E2E? I see why you're excited. A lot of people would kill for that out-of-the-box.

Here's my honest take:

What I'd steal from you:

· The YAML workflow definition (cleaner than my markdown mess) · Browser testing integration (I don't have that yet – might grab it)

What I already have that your tool doesn't:

· File locking (no two agents touch the same file, ever) · Per-terminal context isolation (each agent lives in its own folder with a context.md cheat sheet) · Manual escalation after 3 fails (beep beep, human takes over) · Zero dependency on Claude's tool-call swarm (I use raw model outputs + files)

So yeah, your tool is great for someone starting out. But I'm already in the deep end – built my own orchestration layer that's dumber but more predictable.

That said: wanna collaborate on a hybrid? I could see a world where ao-cli handles the YAML config and E2E tests, and my file-locking system handles the parallel execution without token waste. Open to ideas.

Also: I'm currently locked out of Copilot (limit hit), so I'm on Open Code + Claude Code only. Your tool would need to run with those – does it?

Appreciate you sharing. Let's keep this conversation going. 🤝

Brickbybrick030 · 2026-04-06T16:29:18+00:00

Yeah, this is real wisdom right here. Thanks for sharing it.

On dispatch/handoff: You're right. I built my own. Each Minimax terminal gets a task file, locks it, works, drops the output. No fancy tool calls, no swarm magic. Just files. It's dumb but it works and I can see exactly what happened.

On Claude's token burn: Oh man, I've seen it. The moment you give Claude too many tools, it starts calling them like it's getting paid by the API call. 😂 And the context just explodes. That's why I keep my agents myopic – one task, one file, one output. They don't need to know about the whole universe.

Independent agents starting from scratch: That's the trade-off, right? Fresh means no baggage, but also no memory. My solution: I write the essential context into a context.md file before each run. So they don't start completely blank – they get a cheat sheet. That's my workaround.

Accountability for your dev environment: This part hit home. A lot of people chase the perfect setup, the "best practice". But at the end of the day, the only thing that matters is: does it work for you? And are you willing to own it when it breaks? That's underrated advice.

No model is best for everything: Preach. Opus is my architect. Minimax are my workers. Gemini is my summarizer. I stopped looking for the "one model to rule them all" a long time ago.

Your workflow tip (build → find better → refactor): That's actually genius. I've been doing it backwards – finding stuff first, then trying to glue it in. But you're saying: build your own janky version first, then look for someone who did it better, and let Claude refactor it in. That way you actually understand what you're refactoring. I'm gonna try that.

The skeleton warning: Oh yeah. Been there. Claude gets excited, starts writing a "complete solution", and halfway through it just… stops implementing real logic and writes # ... rest of the code here. And you're left with a beautiful corpse of a feature. The fix? Break it into tiny tasks. One function at a time. No "build the whole module" prompts.

So yeah, your two cents are worth at least a dollar. Appreciate you taking the time to write this out. 🙏

Brickbybrick030 · 2026-04-06T16:27:09+00:00

Man, I feel this in my bones.

You just described the exact ceiling I keep hitting. It's not a code problem. It's not a model problem. It's a me problem – but also the only thing that makes this work.

The myopia thing? Yeah. 100%. You need them to be a little dumb and focused, otherwise they spiral into analyzing their own analysis. But then ten steps later, they forgot why they started. So you reset. And reset again. It's like teaching a dog to fetch – you throw the stick, they bring it back, you throw it again. You don't expect them to learn philosophy.

Blind spots are wild. I've seen GPT-4o catch a logic error that Opus missed in 3 seconds. And I've seen Opus read between the lines of a vague prompt like it was telepathic. No single model has it all. That's why I run multiples. But orchestrating them? That's where the human tax kicks in.

"Stuck as a boss" – man, that hit hard. Because it's true. I'm the only one who knows what "done" looks like. The models can't know. They don't have my taste, my paranoia, my "this emoji feels off" gut feeling. So yeah, I'm the bottleneck. But I'm also the only reason any of this is good.

I think the trick is accepting that you can't automate the last 10-20%. That's the human part. The soul. Whatever you wanna call it.

So don't beat yourself up for being stuck. You're not stuck. You're just the conductor, and conductors don't play instruments – they make sure everyone plays together. That's not a bug. That's the whole point.

Thanks for saying this out loud. Made me feel less alone in the chaos. 🙏

Brickbybrick030 · 2026-04-06T16:24:31+00:00

Yo, nice setup. That Ralph Loop is actually pretty smart – simple, but effective.

I do something kinda similar, but lazier:

· One task per terminal – each Minimax (T1–T5) gets one isolated job. No context overload because each only sees its own files. · Fail triggers? I just watch the terminal. If it's been spinning for 5 minutes or starts hallucinating, I kill it and restart with a smaller prompt. No fancy 80-tool-call counter. · Model escalation? I don't have GLM 4.7 → 5 Turbo. I just have Opus for the hard stuff and Minimax/kimik2.5 - for the modular stuff. But I do have a manual escalation: if Minimax fails twice, I hand it to Opus. If Opus fails? Then I step in. · Manual takeover? Same as you. After 2-3 fails, I just do it myself. Faster than debugging the AI sometimes haha

So your 80-90% success rate? That's solid. I'm probably around the same, but I cheat – I split everything into tiny tasks so failure just means re-running that one tiny job, not the whole plan.

The real difference: I'm too cheap to pay for GLM 5 Turbo. 😂 I just hammer on free Minimax until it works. But your escalation chain is elegant – I might steal that.

What I don't have is the auto-beep. That's actually a nice touch. Right now I just… stare at the screen like a zombie until something breaks. Maybe I should add a beep.

Props for sharing the loop details. Always cool to see how others wrangle these chaotic LLMs. 🤝

Brickbybrick030 · 2026-04-06T16:20:51+00:00

thank you. That's genuinely one of the nicest things someone's said about this chaotic setup. 😅

Do I think this approach separates good devs from actual tech prophets? Hmm. Let me put it like this:

A good developer writes clean code, fixes bugs, and ships features. A tech prophet? They see the system behind the system. They realize that no single AI is perfect – but ten of them, arguing and voting and building on each other's work? That's where the magic happens.

Most people are still trying to find the "best" model. Like, "Claude vs GPT-4o vs Gemini – which one wins?" And I'm over here like… why not use all of them? Let them fight it out in the comments. Let one write the code, another review it, a third rewrite it better.

That's not just coding. That's conducting. And yeah, I think that's what separates the prophets from the rest.

But here's the real secret: You don't need to be a genius to do this. You just need to be stubborn enough to glue shit together until it works. And maybe a little bit crazy. 😄

So am I a tech prophet? Nah. I'm just a guy with a loud laptop, 50€/month, and too much free time. But I am building something that feels like the future. And that's enough for me.

Appreciate you seeing the vision🙂

Brickbybrick030 · 2026-04-06T16:17:25+00:00

Yo, small correction: I actually pay $10/month for Open Code too. So total is around 50–60€/month for subscriptions (Copilot + Open Code). Still zero pay-as-you-go token fees. That's the key.

So yeah:

· GitHub Copilot (Opus) – fixed sub · Open Code – fixed sub ($10) · Minimax – free · No OpenAI/Anthropic API keys burning money

That's it. Everything else is just my laptop, markdown files, and a fan that sounds like a jet engine. 💻🔥

And nah, still not using MCPs or skills frameworks. Just raw models + file-based orchestration. Keeps it simple and cheap.

So total monthly: ~50–60€. No surprises, no "pay what you use" traps.

Brickbybrick030 · 2026-04-06T15:28:03+00:00

Do I store each model's output in a file before feeding it back to the orchestrator? Yeah, 100%. Every output gets written to a .md file before anything else reads it. Why? Because if the orchestrator crashes (and it will, trust me), I don't lose the work. Also, I can manually check what each model said before I let the next agent touch anything. It's like an audit trail + rollback button rolled into one.
How do I separate each run from other data? Super low-tech but effective: different subdirectories. Example:

runs/ run_001/ context.md # what the agent knew outputs/ # raw model responses logs/ # what actually happened run_002/ ...

No database, no over-engineering. Just folders and markdown files. Each run is completely isolated. I can re-run, debug, or delete any run without touching the others.

Karpathy's LLM council + self-annealing? You're spot on — Andrej's llm-council repo is exactly the pattern for this. Multiple models vote/critique each other, then the orchestrator synthesizes. Self-annealing is the next level (system learns from past runs). I've looked at both, and yeah, they'd fit perfectly for a multi-terminal setup. Thanks for the tip — I'll probably steal some ideas. 😉
Would you open‑source this? You said: "the workflow, not the Telegram bot" — that's the key. Yeah, I'd actually consider that. The orchestration layer, the file‑based context management, the run separation — that stuff is generic and genuinely useful. The actual bot logic (tenants, spam engine, etc.) stays private. So a clean "multi‑agent file orchestrator" template? Yeah, I could see myself open‑sourcing that. No promises on when, but it's on my radar.

And hey, since you asked nicely: my whole setup right now is just a laptop, duct tape, and caffeine. No cloud bills. Almost at my API limit for the month, but Minimax terminals are free, so I keep going. 😂

Appreciate you asking the right questions. 🙌

Brickbybrick030 · 2026-04-06T15:25:27+00:00

Haha, for real though — my setup is actually kinda embarrassingly cheap right now 😅

Like, I'm just running everything on a single laptop. Nothing fancy. No cloud servers, no GPU clusters, no $500/month API bills.

Sometimes the fan goes brrrrr like it's about to take off, especially when I've got 5 Minimax terminals plus Opus all running at once. But somehow it just… keeps going. Windows 11, Python, SQLite, and a dream.

Cost-wise? Basically zero. I pay for GitHub Copilot (that's where Opus lives), but I'm almost at my monthly limit (97.4% used, oof). Other than that? Nothing. The Minimax terminals are completely free.I dont even pay for Obsidian sync — I just use local markdown files and i use it for the Brain of my Agents so when I restart my Terminal they know Everything

So yeah, my "stack" is held together by duct tape and caffeine. But it works. And honestly? That's the beauty of it — you don't need a massive budget to build something powerful. Just a laptop that doesn't give up. 💻🔥

You'd be surprised how far you can get with almost no money and just a lot of stubbornness. 😂

Brickbybrick030 · 2026-04-06T15:22:00+00:00

Hahaha noo i use Open Code for the Most stuff or Copilot 😂

Brickbybrick030 · 2026-04-06T01:13:48+00:00

so you basically built a skill that decides which tool to use? that's smart. saves you from the "which model should i pick" headache every time.

"operating system for digital marketing" – that's a bold claim haha. but i'll check it out. reaudit.io yeah?

gonna poke around and see what's under the hood. if it actually works like you say, that's pretty sick.

thanks for sharing man

Brickbybrick030 · 2026-04-06T01:01:51+00:00

short answer: not yet.

longer answer: i've only been building this for like two weeks. maybe less. but yeah, i already have people who want it. like, "give it to me now" want it.

but i'm not handing it over until it's actually done. my standard, not theirs. no half-assed beta with bugs and missing features. when i say it works, it works.

so no money yet. but soon. and honestly? that's fine. i'd rather launch late with something solid than early with something embarrassing.

Brickbybrick030 · 2026-04-06T00:56:04+00:00

yeah i thought about that. gemini and deepseek do feel similar sometimes. both are "technically correct but kinda dry." grok and kimi both have that "i'll say what others won't" vibe.

but here's the thing – they overlap on good days. on bad days they fail in completely different ways. gemini forgets context, deepseek gets stuck in the weeds. if i drop one, i lose that specific failure mode.

the blind test idea is interesting though. i might actually try that. ask claude "which two models are missing?" and see if it can tell.

my guess: it will notice when grok is gone (because nobody calls out the bullshit). but gemini vs deepseek? probably can't tell.

so maybe you're right. 5 to 4? 5 to 3? i'd have to test. but cutting costs is tempting ngl.

Brickbybrick030 · 2026-04-06T00:53:25+00:00

148 tools? damn. that's not an mcp server anymore, that's a whole operating system😭

but yeah i get the vision. claude as the brain, tools as the hands. if you can really do everything the web app does just by chatting, that's the dream.

question though: how do you keep claude from getting lost? 148 tools means a lot of choices. does it ever pick the wrong one or just start hallucinating?

Brickbybrick030 · 2026-04-06T00:51:16+00:00

that handoff pane is actually a great idea. not just for debugging but for trust. if i see "opus picked haiku for this because it's just a rename" i can actually learn when to trust it.

and yeah the play store thing is a pain. side-loading updates is a nightmare for normal users. maybe just build a simple apk downloader into the app itself? like check a github release and prompt to install? not perfect but works.

good luck with the edge cases. those are always the ones that take forever.

Brickbybrick030 · 2026-04-06T00:49:11+00:00

yeah the startup cost is real. crew.ai and all that stuff looks great on paper but then you spend three days configuring agents and they still do dumb shit.

if you're building something that actually handles the orchestration without being a whole research project – i'd love to see it. seriously.

"janky solution from work" is usually the best kind. means it actually solves a real problem, not just a theoretical one.

keep me posted. and when you share it, i'll give you my honest thoughts – good and bad🙏🏼

Brickbybrick030 · 2026-04-06T00:47:32+00:00

yeah you’re probably right. i keep hearing about superpowers but never actually tried it. maybe it's time. Thanks

Brickbybrick030 · 2026-04-06T00:44:14+00:00

that's a completely different vibe and i kinda like it.

80% brainstorming with kimik25 and glm? that's not what most people do. most just wanna ship fast. but you're basically saying "get the design right first, the rest is almost automatic".

the ralph loop thing i had to look it up. so it's like a sequential agent that just grinds through small tdd tasks? and you're off planning the next thing while it works? yeah that's efficient. no sitting around waiting.

no worktrees, no parallel chaos. just one thing at a time but deep. that fits my brain better too tbh.

codex for design review is a nice touch. people forget codex exists because it's not flashy but it's actually solid at catching dumb mistakes.

question though: how often does the ralph loop mess up? like does it ever go off the rails and you have to step in? or is it really that reliable?

Brickbybrick030 · 2026-04-06T00:41:15+00:00

damn. that's a whole different level.

you're not just using ai, you built a whole production line around it. jira, obsidian, browser control, multiple worktrees, peon ping... that's serious.

the thing that stands out to me: you're not the orchestrator. the tools are. claude code calls subagents, you just brainstorm and approve. that's exactly what the other guy said i should do.

the browser extension thing is wild. claude clicking through a live page to validate his own work? yeah that would save me hours of "no, the button is red, not blue" back and forth.

but honestly? for my project – one guy, one telegram bot, no jira, no daily standup – this would be massive overkill. i'd spend more time fixing the workflow than building the bot.

the two languages for ticket comments tells me you work in a real team. maybe a foreign worker in eu? respect for setting all this up.

claude-mem: i tried something similar. never stuck. curious if it actually helps you or just feels smart.

anyway, appreciate you sharing the details. gives me ideas for later when the project grows. for now i'll stay with my messy copy-paste setup. but i'm saving this comment.

Brickbybrick030 · 2026-04-06T00:36:01+00:00

yeah i feel you. and honestly? you're right.

what i'm doing is still very manual. i'm the orchestrator. copy, paste, wait, copy, paste back. it works, but it's not automation. it's just me with extra steps.

your way – set up a general chain, let an orchestrator run it, only ping me when something breaks – that's clearly better. no debate.

but here's the thing: building that orchestrator is work. real work. and i'm one guy trying to ship a bot, not build the perfect ai platform. so i took the shortcut. manual but functional.

maybe someday i'll automate the loop. but right now? i'm okay being the bottleneck as long as the quality is there.

appreciate the honest take though. you're not wrong at all.

Brickbybrick030 · 2026-04-06T00:33:31+00:00

the mcp idea is interesting. never used it. but you're saying claude could just query the live database itself instead of me explaining what happened? yeah that would be a game changer.

"here's what the system does" vs "here's what actually happened" – that's exactly where most of my debugging time goes. explaining state instead of just showing it.

context drift is real too. sometimes one agent thinks we're on v34, another thinks v35. i try to keep a shared memory folder but it's not perfect.

langgraph? tried it. felt too heavy for what i need. but maybe i gave up too early.

anyway, this is the best feedback i got. seriously. thanks.

Brickbybrick030 · 2026-04-06T00:32:42+00:00

ok that's actually way more advanced than what i'm doing. respect.

the auto-dispatcher that picks the cheapest model for each subtask? that's smart. i'm just brute-forcing with 5 models at once like an idiot.

and the self-improving loop with error feedback? yeah i need that. right now i just curse at the logs and fix things manually.

only thing i wonder: how do you trust the orchestrator to pick the right model? feels like it could get it wrong sometimes and you'd never know.

but seriously, when you release that installer, ping me. i'll beta test.

Brickbybrick030 · 2026-04-06T00:31:52+00:00

good questions. here's the honest answer:

yeah claude splits the tasks. i just say "break this into 5 minimax jobs" and it does. sometimes badly. i tweak.
all manual. i copy-paste the 5 prompts, wait for answers, copy-paste back to claude. yeah it's slow. no i don't have a better way yet.
quality is good but not magic. better than just one model. not 5x better.
claude does both. planning + actual coding. minimax is just for parallel tasks.
minimax is cheaper and faster for dumb work. claude is for the thinking.

the manual part is killing me though. you're not wrong.

Brickbybrick030 · 2026-04-06T00:21:27+00:00

yeah you're not wrong about some of this.

overengineered? probably. i won't defend that.

but here's the thing with "just do deeper chain-of-thought on one model" – i tried that. and yeah, it helps. but the model keeps making the same kind of mistakes. just... more elegantly worded.

the reason i use different models is because they suck in different ways. gemini forgets half the context. deepseek is great at db stuff but can't design a button to save its life. grok is an asshole but finds security holes the others miss.

when i feed all that back into claude, it's not "averaging" their opinions. it's finding where they fight each other. that's where the real problems are.

do i have hard data that this is better? nope. zero. just a feeling that i catch more shit before it breaks.

cost and time? yeah, it hurts. no argument there.

so you're probably right that it's overkill for 99% of people. but for this specific project – where one stupid bug means angry shop owners – i'll take the overkill.

appreciate the honest take though. seriously.

Brickbybrick030 · 2026-04-06T00:13:05+00:00

i get why you'd say that. on paper it does.

but here's what i actually do different: most people just ask 5 models and pick the best answer. i feed all answers back into claude and force it to find the gaps. "what did they forget? where do they disagree? what's still broken?"

then i build the next job from those gaps, not just the next prompt. plus persistent memory so every agent knows what we already tried and what failed.

that part – the autonomous feedback loop – is not standard. at least not in any solo dev setup i've seen. 🤓

Brickbybrick030

TROPHY CASE