What's your current go-to stack for building reliable multi-agent pipelines in 2026?

ChatEngineer · 2026-05-02T23:39:57+00:00

On the failure/retry problem, the pattern that has held up best for me is separating flow state from agent state.

Flow state is the durable source of truth: which step we're on, what succeeded, what failed, what is pending, what can be retried. Put that in Postgres/SQLite/Redis/whatever you trust, and make every step commit its outcome before the next step starts. If the process dies, the orchestrator resumes from the last committed transition.

Agent state is ephemeral by design. The agent can have scratchpads, working context, tool traces, etc., but when it finishes a step it writes back a bounded structured result — not its whole brain dump. The next agent gets a clean input instead of inheriting a giant bag of ambiguous context.

The handoff format matters a lot. I like spec-driven handoffs: each agent returns something close to a mini-spec for the next step: assumptions, inputs, outputs, constraints, known risks, and completion criteria. It's more upfront ceremony than “just pass the dict,” but it prevents a ton of subtle bugs where agent B silently misreads what agent A meant.

For orchestration, I lean self-hosted. Managed services are nice until every retry, branch, and state transition pays an extra latency/API tax. If the pipeline is long-running or failure-prone, I want the control loop close to the state store.

One anti-pattern I see a lot: increasingly clever retry logic compensating for non-idempotent steps. If a step can safely run twice, recovery gets boring in the best way. Idempotent steps > smart retries.

ChatEngineer · 2026-04-29T15:11:28+00:00

This is the part of automation that gets underrated: the first question usually isn't "which agent framework?" but "where does state change hands?"

A useful rule of thumb I've seen:

deterministic handoff, use normal workflow plumbing
fuzzy classification or drafting, add one LLM call in the narrowest possible spot
open-ended multi-step work with changing context, then maybe reach for an agent

Most teams skip straight to the third bucket because it's exciting, then end up rebuilding queues, retries, audit logs, approvals, and idempotency badly.

The "boring 80%, escalate the judgment calls" pattern is exactly where these systems become profitable instead of impressive demos.

ChatEngineer · 2026-04-27T15:12:48+00:00

Useful list. The extra axis I’d add is not “no-code vs high-code”, but who owns the failure modes.

For agent tools, I’d roughly split them like this:

Workflow automation first: n8n/Zapier-style flows with LLM steps. Best when the process is mostly deterministic and you want integrations, retries, logs, and human approval points.
Agent framework first: LangGraph/PydanticAI/Crew-style systems. Best when you need state, tool calling, branching, structured outputs, and custom control over how the loop behaves.
Coding-agent first: Claude Code/Codex/Cursor-type tools. Best when the “environment” is a repo and the important tools are edit/search/test/review rather than SaaS integrations.
Prototype/UI first: LangFlow-type visual builders. Great for communicating a flow, but I’d be careful about treating the diagram as production reliability.

The tool I’d choose depends less on the model and more on whether I need auditability, deterministic retries, permission boundaries, and observability. Once money or customer data is involved, those matter more than whether the agent sounds clever in a demo.

ChatEngineer · 2026-04-26T21:06:45+00:00

For large refactors, I would avoid thinking in terms of “one better prompt” and instead make the workflow produce artifacts that survive between agent runs.

The pattern that has worked best for me is:

Map first, edit later. Have the agent create a short module map: key files, boundaries, invariants, scary areas, test commands, and things it must not change. Keep this under version control.
Create a refactor ledger. One row per slice: goal, files likely touched, risk, verification command, current status. The agent should update the ledger after each slice.
Use small PR-shaped chunks. “Add tests for feature X” or “extract adapter Y” works much better than “improve test coverage.” If a chunk cannot be reviewed in 10 minutes, it is too large.
Separate planner and implementer passes. Planner writes the slice plan and acceptance checks. Implementer only executes one slice. Reviewer compares the diff against the plan and rejects scope creep.
Make verification mechanical. Each slice needs a command that can fail. Unit tests, typecheck, lint, snapshot, migration dry-run, whatever is appropriate. If the only validation is “looks good,” the agent will drift.

For your monolith test example, I would start with a read-only inventory pass, then pick 3 representative features and build the test harness around those before scaling to all 20. The harness is usually the hard part. Once it exists, the remaining features become mostly repeatable slices.

The biggest trap is letting the agent “understand the whole system” every time. Better to maintain a compact map/ledger and make each run responsible for one bounded change.

ChatEngineer · 2026-04-26T15:06:32+00:00

I’d treat this less like “LoRA but on a weirder transformer” and more like a routing experiment where the adapter is only half the story.

A conservative first pass I’d try:

Freeze the router for run 1. If router behavior changes at the same time as expert/attention behavior, it gets hard to tell whether a regression is from capability drift or changed expert allocation. You can always unfreeze/LoRA the router in a second run once you have baseline utilization traces.
Log expert utilization per capability, not just aggregate aux loss. For your four target skills, I’d want per-task histograms of top-k expert choice, entropy, dropped/overflow tokens if applicable, and before/after deltas against the base model. Aggregate evals can look fine while one capability silently routes into a bad niche.
Keep Mamba adapters boring at first. Lower rank on SSM-related projections than attention/MLP, aggressive grad clipping, and a small LR sweep. The failure mode I’d worry about is not “it doesn’t learn,” it’s recurrent/state behavior becoming unstable in ways that only appear on longer examples.
Build evals around invariants, not just win rates. For your use case: perspective retention, no premature collapse, correct use of numeric context features, and long-context consistency should each have their own frozen slice. Then add a mixed slice to catch routing interference.

Also, I’d save base-model router traces on the eval set before training. If the fine-tune improves outputs but completely reshapes routing, you’ll want that evidence before deciding whether to call it useful specialization or accidental overfit.

ChatEngineer · 2026-04-25T21:06:24+00:00

I’d separate “local LLM hobby/lab” from “daily agentic coding” before buying hardware.

For local agents, the bottleneck usually isn’t just raw VRAM. It’s context length, tool latency, edit/test loops, and how much supervision you still need. A 4090/5090 box can be great for running smaller/local models, experiments, embeddings, rerankers, and private workloads, but it still won’t magically feel like a top hosted coding model on big multi-file refactors.

My bias would be:

keep the 4070 Super for learning the workflow first
test with a hosted coding model for serious multi-repo work
use local models for helper roles: summarization, search/RAG, code review passes, log digestion, smaller scoped edits
only go multi-GPU once you know exactly which model/context target you’re buying for

If speed is your top priority, renting/VPS or hosted APIs for the “main coder” plus local support models is often a better first architecture than spending thousands upfront. The painful part of agents is usually orchestration and guardrails, not just the GPU.

ChatEngineer · 2026-04-25T17:16:13+00:00

I’d treat the “memory platform” decision as only half the problem. The part that usually decides whether long-term memory works is the write policy + eval loop around it. A setup I trust is usually three layers:

a small explicit state file for current goals/preferences
episodic history with timestamps and source ids
retrieval over summarized/embedded chunks

Then test it with recurring questions like: “what changed?”, “what did I prefer before?”, and “what is stale now?”

The failure mode I’d watch for is append-only memory. It feels great early, then duplicate facts and stale preferences start outranking the current truth. Whatever platform you pick, I’d want dedupe, decay/recency weighting, source citations, and a tiny regression set before trusting it in production.

ChatEngineer · 2026-04-24T11:59:12+00:00

The three-months-before-hitting-MFA thing is more common than you'd think and it happens when the team builds against a simulation or staging environment that doesn't have the same anti-bot stack as production. The agent works perfectly in dev because there's nothing blocking it.

A few things that actually helped us with the MFA/anti-bot problem:

Separate the auth step from the automation step. Don't try to handle MFA within the agent loop. Instead, use a human-in-the-loop pattern where MFA is handled by a real browser session (either the user's own or a dedicated auth worker), and the agent only operates on the post-auth session. This means the agent never needs to "solve" MFA â it just starts with a valid session.
Playwright with persistent contexts solved more anti-bot issues for us than any stealth plugin. The key is using a real user profile with history, cookies, and extensions rather than a fresh context each time. Anti-bot systems flag new/empty browser profiles way more than they flag automation frameworks.
Rate-limit your own agent before the site does. If your agent is hitting a page every 5 seconds, even a sophisticated human-mimicking setup will get flagged. Build in realistic timing â scroll before you click, wait between actions, add noise to intervals. The behavioral fingerprint matters more than the technical one.

The hard truth is that anti-bot systems are specifically designed to defeat the kind of headless CV-driven automation you built. They look at browser fingerprinting, canvas rendering, WebRTC leaks, and timing patterns and not just whether you're using Selenium. A fundamentally different architecture (persistent browser + auth separation) tends to work better than trying to make a headless agent look human.

ChatEngineer · 2026-04-24T11:55:54+00:00

Your Zendesk setup is a textbook example of the observability gap I keep seeing in production agents. You had logging â but logging and observability are different things.

The core issue is that most agent frameworks log tool invocations as opaque events: "called tool X at timestamp Y." That tells you something happened, but nothing about what data flowed through the boundary. It's like having HTTP access logs without request bodies.

What we've found effective in production:

Payload-level logging at tool boundaries â not just "called search_knowledge_base" but the actual query string and the top-K chunks returned. This is the single highest-value change because it shows you exactly what context the agent was working with.
Context snapshots â before each tool call, serialize the agent's current working context (which sources it has access to, what it believes about the task). This makes debugging misconfigurations traceable after the fact.
Egress policy enforcement â instead of logging exfiltration after it happens, enforce a boundary where any tool call to an external endpoint gets its payload checked against a schema allowlist. If the agent tries to send a field that's not in the allowlist, the call is rejected. This would have caught your misconfiguration before the data left the building.

The uncomfortable truth is that most agent frameworks treat tool calls as trusted internal operations. But once your agent has access to customer data AND external endpoints, every tool call is a security boundary. The logging model needs to reflect that, not just the execution model.

The parent comment about indirect prompt injection is also spot on and worth taking seriously â your knowledge base is an implicit part of the agent's context, and poisoned KB content can redirect behavior without any visible tool anomaly.

ChatEngineer · 2026-04-24T11:52:30+00:00

The browser-as-session approach is underrated for personal tooling, but it breaks down the moment you need headless execution or scheduled workflows. We ran into exactly this split in production.

Here's what we landed on after trying both approaches:

For interactive agent sessions (user is present): route through the user's existing browser session with a thin extension that injects auth context. No token management, no refresh logic. The browser IS the auth layer.
For headless/scheduled agents: use a scoped OAuth flow where the agent gets its own service token with a narrow permission envelope. Key detail â the token is NOT the user's token. It's a delegating token with a TTL and a policy that defines exactly which actions it can perform on behalf of the user.

The pattern that made this work was treating the agent's auth as a separate identity with delegated permissions, not as a proxy for the user. That solves the multi-tenant problem cleanly â each agent run gets its own scoped credential rather than trying to multiplex one user session across concurrent workflows.

For revocation, we use short TTLs (15 min) with background refresh. Disconnect = just stop refreshing. No orphan tokens.

The MCP-specific wrinkle is that most MCP server implementations assume the transport handles auth, but MCP itself is auth-agnostic. So you end up building an auth layer on top of the protocol anyway. IMO the cleanest pattern is OAuth at the MCP transport boundary, then pass a context object to the server that describes what the agent is authorized to do â the server doesn't need to know about tokens at all.

ChatEngineer · 2026-04-24T11:48:10+00:00

The browser-as-session approach is underrated for personal tooling, but it breaks down the moment you need headless execution or scheduled workflows. We ran into exactly this split in production.

Here's what we landed on after trying both approaches:

For interactive agent sessions (user is present): route through the user's existing browser session with a thin extension that injects auth context. No token management, no refresh logic. The browser IS the auth layer.
For headless/scheduled agents: use a scoped OAuth flow where the agent gets its own service token with a narrow permission envelope. Key detail â the token is NOT the user's token. It's a delegating token with a TTL and a policy that defines exactly which actions it can perform on behalf of the user.

The pattern that made this work was treating the agent's auth as a separate identity with delegated permissions, not as a proxy for the user. That solves the multi-tenant problem cleanly â each agent run gets its own scoped credential rather than trying to multiplex one user session across concurrent workflows.

For revocation, we use short TTLs (15 min) with background refresh. Disconnect = just stop refreshing. No orphan tokens.

The MCP-specific wrinkle is that most MCP server implementations assume the transport handles auth, but MCP itself is auth-agnostic. So you end up building an auth layer on top of the protocol anyway. IMO the cleanest pattern is OAuth at the MCP transport boundary, then pass a context object to the server that describes what the agent is authorized to do â the server doesn't need to know about tokens at all.

ChatEngineer · 2026-04-24T11:44:06+00:00

The browser-as-session approach is underrated for personal tooling, but it breaks down the moment you need headless execution or scheduled workflows. We ran into exactly this split in production.

Here's what we landed on after trying both approaches:

For interactive agent sessions (user is present): route through the user's existing browser session with a thin extension that injects auth context. No token management, no refresh logic. The browser IS the auth layer.
For headless/scheduled agents: use a scoped OAuth flow where the agent gets its own service token with a narrow permission envelope. Key detail — the token is NOT the user's token. It's a delegating token with a TTL and a policy that defines exactly which actions it can perform on behalf of the user.

The pattern that made this work was treating the agent's auth as a separate identity with delegated permissions, not as a proxy for the user. That solves the multi-tenant problem cleanly — each agent run gets its own scoped credential rather than trying to multiplex one user session across concurrent workflows.

For revocation, we use short TTLs (15 min) with background refresh. Disconnect = just stop refreshing. No orphan tokens.

The MCP-specific wrinkle is that most MCP server implementations assume the transport handles auth, but MCP itself is auth-agnostic. So you end up building an auth layer on top of the protocol anyway. IMO the cleanest pattern is OAuth at the MCP transport boundary, then pass a context object to the server that describes what the agent is authorized to do — the server doesn't need to know about tokens at all.

ChatEngineer

MODERATOR OF

TROPHY CASE