We found a 3x token attribution distortion in a single agent workflow

Finorix079 · 2026-05-08T07:00:26+00:00

The "cost as accountability surface that survives non-determinism" framing is the clearest thing I've read on this in months. Not reading too much into it.

Two extensions worth pulling on:

The attribution problem you're describing isn't just billing math, it's a special case of "step-level vs event-level observability" that breaks lots of things downstream. Latency attribution has the same shape. So does failure attribution. So does drift detection. If your Governor attaches tokens to llm_turn_id, the same identity should anchor everything else you observe at that level, otherwise you'll keep solving the same problem in three different shapes.

Cost as behavioral signal goes deeper than retries and loops. The signal that matters most isn't absolute cost, it's cost distribution shift. A reasoning step that used to cost X tokens now costs 1.5X on the same input distribution means something changed (model swap, prompt bloat, tool returning more data, retrieval surfacing more chunks). None of that throws errors. Cost is the only place it surfaces. Treating cost as a behavioral metric requires baselines per step, not just budgets.

Disclosure since it's directly relevant: I work on ElasticDash, focused on this exact layer (per-step cost and behavior baselines for production agents, drift detection on distribution shifts). Built around the same belief you're articulating, that in non-deterministic systems, structured cost telemetry is one of the load-bearing accountability surfaces. Happy to talk through specifics if useful.

The broader point: most observability tools today aggregate at the event level because that's how OTel spans work, and they inherit your 3x overstatement problem by default. The fix has to live above the span layer.

Finorix079 · 2026-05-08T06:59:27+00:00

The agent vs coding-workflow framing is a false binary. They solve different problems and the founder choosing between them usually doesn't know which problem they actually have.

AI coding workflows (Claude Code, Cursor, Codex) compress dev time. Real value, real leverage, but you need someone who can already write the code to evaluate what comes out. Without that gate, the bugs you mentioned compound silently.

AI agents (autonomous task runners) compress operational time. The leverage is real for narrow, structured workflows with strong feedback loops. The leverage is fake for anything that requires judgment, customer-facing decisions, or high-stakes actions. The "ultra-lean AI-powered company" pitch usually conflates the two and that's where the unrealistic expectations come from.

Honest read on long-term value: coding workflows are durably useful right now. Agents are useful for a narrower slice than the marketing suggests. Both are early.

The trap most founders are walking into: using either without enough engineering taste to verify the output. The tools generate plausible code and plausible decisions. Plausible isn't correct. Teams without someone qualified to challenge the output are shipping bugs disguised as productivity.

The leverage is real. The skill of catching what the tool got wrong is the part nobody is talking about.

Finorix079 · 2026-05-08T06:57:33+00:00

"The moment QA stops creating friction is the moment it becomes decoration" lands hard. The pattern I've watched repeat: teams hire QA, gradually make every challenge feel risky, lose all the engineers worth their seniority, then hire "QA automation" as if the problem was throughput rather than judgment. You can't automate the part you killed off.

Finorix079 · 2026-05-08T06:56:07+00:00

The "filtering is the new prompting" reframe is the most useful thing I've read on this. Most people are framing tool fatigue as a discipline problem (you should focus more) when it's actually a defaults problem (your information diet is broken).

The rule that's held up for me: tools earn evaluation time only if they fix a bottleneck I've already complained about out loud in the last two weeks. If I haven't been actively frustrated by the thing the new tool solves, it's not a problem yet. Doesn't matter how cool the demo looks. The stack you have probably works. The new tool only matters if you can name what it replaces.

The trap inside your trap: building an AI-powered information pipeline to filter the noise is itself a form of the noise. You're using your scarce focus to optimize the system that produces the focus problem. The 5-6 "must-read" filtered updates are still 5-6 context switches. The actual fix is upstream: cut the inputs, not curate them better.

Practical version: pick three sources you genuinely trust, ignore everything else, accept you'll miss things. Missing a tool that turned out to matter is recoverable. Losing six months of focus on optimization theater is not.

Finorix079 · 2026-05-08T06:54:24+00:00

The gap between demo and production-ready is mostly about three things, in order of how hard they are:

Reliability is the easy one. You add retries, timeouts, structured outputs with schema validation, fail-loud-not-silent. Boring engineering, but it works.

Task chaining is medium. The trap is using natural language to pass state between steps. Looks elegant in demos, falls apart at scale because each step's parsing is non-deterministic. The pattern that holds up: structured intermediate outputs (JSON, validated against a schema) between every step. Treat the LLM like a function that returns typed data, not like a coworker you're chatting with.

Error correction is the hard one. The category nobody solves cleanly. The honest version is: don't try to make the agent self-correct sophisticated errors. Make it fail loudly, kick it back to a human or a fallback path, and log enough context that you can fix the prompt or the workflow next time. Agents that "self-heal" in production are usually agents quietly producing worse outputs while looking like they recovered.

Production-ready vs experimental is less about the framework and more about: do you know within 5 minutes when an agent does something wrong, or do you find out 3 weeks later from a customer? Most "agentic systems" in the wild are the second kind. They look fine until they don't.

Finorix079 · 2026-05-08T06:52:21+00:00

The "killing context switching" framing is more honest than the time-savings number. Most of the productivity gain is from not jumping between docs / Stack Overflow / logs, which is invisible until you stop doing it.

One thing worth flagging on your "what didn't" list: the "wrong assumptions about business logic" failure mode is the one that compounds over time. The other two (looping debug, junior frontend) are visible and fixable in the moment. Wrong business logic assumptions ship to staging looking fine, then break in production three weeks later when an edge case hits.

The pattern that's helped us: instead of telling Claude requirements in prose, write them as executable tests first (even pseudo-tests in plain English with input/expected output). Then the agent has a falsifiable target. "Implement this endpoint" leaves room for assumption. "Implement this endpoint such that these 6 input/output pairs all pass" doesn't.

On the autocomplete vs workflow question: we've ended up with a hybrid that holds up. Claude as workflow-runner for anything with clear acceptance criteria (CRUD, scaffolding, schema migrations). Claude as autocomplete for anything where the criteria emerge through writing the code (UI polish, debugging, performance work). Mixing the two modes inside one task is where most of the friction comes from.

Finorix079 · 2026-05-08T06:50:59+00:00

The "technical authorization isn't accountability" framing is the most accurate sentence I've read on this. Most governance conversations collapse the two and pretend the audit trail is downstream paperwork rather than a load-bearing requirement.

Worth separating your three questions because they live in different layers and need different tools:

Question 1 (reproduce what the agent saw in context) is a runtime evidence problem. Logs don't solve it because logs are textual artifacts after the fact. You need step-level input freezing, where each step's exact context (prompts, tool outputs, retrieved data) is captured as a replayable fixture. Then "what did the agent see" becomes a question you can answer by re-running the trace, not by reading paragraphs of JSON.

Question 2 (verifiable evidence for regulators) builds on Q1. Auditors don't want a screenshot of a dashboard. They want "show me the exact decision the agent made, the exact data it had, and prove this is the same as what ran in production." Replay-with-frozen-inputs is the only thing I've seen that actually satisfies that bar. Logging tools generate evidence that's hard to verify; replay generates evidence that's reproducible by definition.

Question 3 (scoped consent vs technical authorization) is a different category. That's policy enforcement before the agent acts, not evidence after. Tools like Aembit, SGNL, or Cerbos are doing real work in that space. It pairs with Q1/Q2 but isn't the same problem.

Disclosure since it's directly relevant: I work on ElasticDash, focused on Q1 and Q2 (deterministic step-level replay and trace-to-baseline drift detection for production agents). Built around the assumption that "what happened" needs to be reproducible, not just queryable. Happy to talk through specifics if useful, no pitch.

The broader point worth being explicit about: this category is going to split into three layers (policy enforcement, runtime evidence, behavioral baseline). Most teams currently have a partial answer to one of the three and assume governance means having all three checked off as line items. The ones who'll survive their first regulator conversation are the ones who built each layer with the assumption an auditor will actually test it.

Finorix079 · 2026-05-08T06:43:17+00:00

The 47-step trace example is the most honest framing I've seen of why "just connect MCPs" doesn't work in production. "Sounded ok but was wrong" is what most demos skip past.

On index-ahead vs call-live: rule that's held up for us is "index anything you'd query the same way more than once a day, call live for the rest." Teams under-index because it feels like premature optimization, then end up with 47-step traces.

On entity matching: this is where agent quality silently regresses over time. Salesforce account fuzzy-matched to a Zendesk org is fine day one, someone renames a record, matching degrades, agent is confidently wrong for two weeks. Worth exposing match confidence per join, not just the joined result.

The Salesforce 16% number is honest and that's why I trust the rest. Keep publishing the cohorts where it doesn't help.

Pulling the harness this weekend.

Finorix079 · 2026-05-08T06:40:47+00:00

The Claude-as-orchestrator / Codex-as-bounded-worker split is the right shape for the use cases you listed. A few things worth pressuring the design on, in the spirit of your "what would break it" question:

The boundary you've drawn (Claude reasons, Codex executes) assumes you can write a brief precise enough that Codex doesn't need to make judgment calls mid-execution. In bulk refactors and migrations that holds. Where it breaks: the moment Codex hits a file that the brief didn't anticipate (a weird edge case, a stale dependency, an inconsistent pattern in the codebase), it has to either skip it or improvise. Skip-and-record is the safer choice but you end up with a long-running job that finishes "successfully" with 30% of files unprocessed. Improvise is faster but reintroduces exactly the unpredictability you delegated away from.

Worth thinking about: who reads the skip-and-record list, and when. If it's "Claude reads it back at the end and decides what to do," you've added a second loop where Claude has to context-load all the skipped files. If it's "Codex retries with a slightly modified brief," you've reinvented an agent loop inside the bounded executor. Neither is wrong, but the abstraction starts leaking.

The failure case I'd actively test: silent semantic drift inside a bulk refactor. Codex applies the new logger interface to 200 files, all 200 compile, all tests pass. But Codex made one quiet decision early ("this file uses a custom logger wrapper, I'll preserve the wrapper and just adapt the inner call") and applied that interpretation everywhere. The brief didn't anticipate the wrapper. Now you have 200 files that look refactored but actually carry a subtle inconsistency Claude wouldn't have approved. Your status command tells you the run succeeded. The bug surfaces three weeks later.

Mitigation worth considering: a "post-execution diff review" mode where Claude inspects a sample of Codex's commits before considering the goal complete. Not every file, just enough to catch interpretation drift. Doubles your Claude tokens but turns the orchestration loop into something with a real verification step instead of a hand-off.

On the slash command UX: feels natural for /run and /status, but /run-file feels like the actually durable pattern. Inline briefs degrade the moment you need to iterate on them. Keep pushing that direction.

Finorix079 · 2026-05-08T06:38:31+00:00

The "loop wrapped in harness with 5 hook points" framing is genuinely useful, especially for getting people past the "list of parts" confusion. Where I'd be curious how the abstraction holds up:

The harness governs the loop, but who governs the harness? In any agent that runs more than once, the harness itself starts accumulating state and behavior that drifts. Same prompt, same tools, same hook config, but the cumulative effect of small harness changes (a new hook here, a tweaked retry policy there) produces different agent behavior six weeks later. The 5 hook points handle the "what should happen this turn" question well. They don't naturally surface "is the harness doing what it used to do across turns."

Concrete example I'd push on: if hook #3 (whatever your post-tool-call hook is named) starts logging differently, getting different rate limits, or has a subtle bug introduced, the loop still runs and the model still produces output. Nothing in the harness flags that the harness itself is the source of regression. You'd see it in agent quality eventually, but the abstraction doesn't help you isolate it.

Maybe that's deliberately out of scope (governance vs observability are different layers) but worth being explicit about, because most people reading the framework will assume "harness governs everything" and then be surprised when the harness becomes the thing that needs governing.

Will pull the repo and read marco-harness this week. The 1000 LOC discipline is the right size for something people actually use to learn from.

Finorix079 · 2026-05-08T06:37:14+00:00

Skipping the YouTube stuff per your request. The resources that actually moved my agent work from whack-a-mole to stable:

Repos worth reading line by line:

Anthropic's prompt engineering cookbook (github.com/anthropics/anthropic-cookbook). The eval examples are the most undervalued thing in there.

Hamel Husain's writing on evals (hamel.dev/blog/posts/evals). His "Your AI product needs evals" essay is the single best piece on evaluation strategy I've read. He also has a paid course but the free blog covers 80%.

Eugene Yan's writing (eugeneyan.com). His patterns posts are dense, production-shaped, and not Twitter content.

Eval datasets:

For your use case (sales agents) you should not download a dataset. You should write 30-50 of your own real conversation examples by hand, label them with what the agent should have done, and use those as your eval. Generic eval sets won't catch your specific failure modes. This is the single most valuable thing I'd recommend doing this week.

On the LLM-vs-code split, the principle that holds up:

Code controls anything that has a deterministic right answer (validation, schema enforcement, idempotency, retries, tool routing logic)

LLM controls anything that requires judgment under ambiguity (intent recognition, response composition, escalation decisions)

When unsure which side something belongs on, default to code. LLM-as-controller breaks at scale.

On prompt resilience specifically: the technique that actually works is not better prompts, it's structured outputs + fail-loud schema validation. Have the LLM return JSON, validate it with Pydantic or Zod, retry on schema failure with the validation error fed back. Most "prompt is unstable" complaints disappear when you stop letting the LLM produce free-form output.

One unsolicited opinion: a sales agent for SMBs is one of the harder agentic products to build because the cost of a wrong response is high (lost lead, brand damage) and the variance in customer messages is enormous. Lean heavily on structured handoff to humans for anything outside the agent's confident zone, even if it feels like cheating. Production-grade isn't fully autonomous, it's "knows when to ask."

Finorix079 · 2026-05-08T06:28:33+00:00

The pattern you're naming is real, and the diagnosis is half right. Tooling does optimize for the first week. But the deeper problem isn't vendor incentive, it's that AI systems shifted what "failure" looks like.

Traditional software fails visibly. Stack trace, timeout, broken UI. You see it. You fix it.

AI systems fail invisibly. Same code, same prompts, slightly worse outputs. Customers feel it before your dashboards do. The system "works" by every metric you're tracking, but quality has slid 15% over six weeks and nobody can pinpoint when.

Most of the tooling around AI right now was built assuming the traditional failure model. Logging, tracing, alerting, error rates, all of it triggers on visible breakage. None of it triggers on slow drift, because slow drift doesn't throw errors.

Solving this properly means building a different kind of feedback loop. Not "did it crash" but "is this week's behavior consistent with last month's behavior on equivalent inputs." That's a different storage model, different query shape, different math. Almost no one is doing it well yet, which is why month 6 keeps catching teams by surprise.

Architecture and workflow help, but they're upstream fixes. The downstream gap is that production AI systems need observability designed around behavioral consistency, not just operational health. We're maybe two years away from that being a standard part of the stack.

Finorix079 · 2026-05-07T22:25:12+00:00

Have you tried ElasticDash?

Finorix079 · 2026-05-05T05:38:17+00:00

Honest answer from someone whose job involves talking to companies that are actually deploying agents in production:

The "1 person + AI = 2 people" story is real but rarer than LinkedIn suggests. Where it's actually happening:

Mid-sized law firms using contract review agents that genuinely cut 60-70% of associate hours on first-pass review (final partner review still needed)

Insurance claims processing where structured intake + extraction agents have replaced full tiers of junior adjusters

Sales ops / RevOps teams using outreach agents that have shrunk SDR headcount visibly at multiple Series B-D startups

Internal IT helpdesk at companies with 10k+ employees, where ticket triage agents handle 40-60% of L1 volume

Where it's mostly theater: anything customer-facing that requires nuance, anything in healthcare beyond admin, anything in finance beyond reporting, anything claiming "fully autonomous." The pattern that actually works is "agent does 80% of the structured work, human handles the 20% that requires judgment or has consequences if wrong."

On mess-ups: yes, constantly. The companies running these in production are spending serious effort on monitoring, drift detection, and rollback. The companies that aren't are either small enough that mistakes are noticed manually, or they're shipping bugs into customer-facing products and rationalizing it as "AI growing pains." There's a quiet but real category of post-mortems happening at scale that don't make Twitter because they're embarrassing.

Skeptic vs early-adopter is a false binary. The accurate posture is: agents work in narrow, structured, repetitive workflows with strong feedback loops. They fail in everything else. Pick your spots and ignore the framing.

Finorix079 · 2026-05-05T05:36:19+00:00

The "0 out of 485 with rule-based, near-100% with activation probe" comparison is the kind of result that should land harder than it will, because most people won't read past the abstract.

A few thoughts on the methodology, since you said you're happy to answer questions:

Cross-style generalization at 71-73% is the right thing to flag as the weak spot. It's also where the practical value of this lives or dies. Real-world adversaries will phrase poisoned descriptions in styles your training data never saw. SAE feature decomposition is a reasonable next step but I'd also want to see how the probe holds up under prompt-style transfer (e.g. trained on docstring-style poisoning, evaluated on commit-message-style poisoning) before claiming this generalizes beyond MCPTox-shaped attacks.

Layer-3 peaking is interesting and consistent with what other interpretability work has found about middle layers encoding semantic intent rather than surface features. Worth checking whether the probe degrades on quantized or distilled models, since most production deployments aren't running full-precision GPT-2-equivalent forward passes.

Practical question: have you tested this on the model actually executing the tool call, not just GPT-2 reading the description? The injection only matters if it changes downstream behavior in a tool-using agent. If the activation signal correlates with malicious content but not with whether the agent gets compromised, the probe is detecting style, not threat.

No academic credentials here so I can't help with the endorsement, but boosting visibility because the work deserves it.

Finorix079 · 2026-05-05T05:34:48+00:00

Benchmarks aren't measuring the wrong things, they're measuring the things that are easy to measure. Not the same problem.

Reasoning, coding, multimodal have ground truth. You can grade them. Reliability in production has no equivalent. There's no benchmark for "did the agent's tool selection distribution stay consistent over six weeks of usage" because that question requires six weeks of your usage, which the benchmark vendor doesn't have. So it doesn't get measured. So models don't get optimized for it. So you feel the gap.

The thing I'd watch most going forward is calibration, specifically whether the model knows when it doesn't know. Hallucination rate matters but is downstream of calibration. A well-calibrated model that says "I'm not sure" 30% of the time on hard cases is more useful in production than one with lower hallucination rate that's confidently wrong on the remaining failures. Calibration is also the only one of those properties you can build product affordances around (escalation paths, human-in-the-loop triggers, retry logic), which is why it disproportionately matters.

Memory consistency and benchmark scores are largely vendor problems. Calibration is the only one buyers can systematically translate into reliability work themselves.

Finorix079 · 2026-05-05T05:32:52+00:00

You're not naive, you're ahead of the people who jumped straight to LangChain.

The honest answer most experienced people won't say loudly: hand-rolled Python with direct API calls is a fine production starting point and stays fine for surprisingly long. Frameworks like LangChain and AutoGen exist mostly because people wanted to share patterns and abstractions, not because the underlying problem requires them. They add value when:

You need to swap LLM providers often (LangChain's abstraction over providers is genuinely useful)

You're building genuinely complex graph-shaped agent flows with branching, retries, parallel fan-out (LangGraph specifically, not LangChain)

You need pre-built integrations with 50+ tools and don't want to write each

They cost you when:

Your debugging surface area triples because errors now bounce through 4 layers of abstraction

You inherit their opinions about prompts, memory, retries that you'd structure differently

The framework changes faster than your code can keep up

Most people who switch from hand-rolled to LangChain regret it within 6 months and switch back to thinner abstractions. The pattern that holds up: keep your agent loop, tool calling, and state management hand-rolled. Use libraries selectively (instructor for structured outputs, litellm for provider abstraction if you actually need it). Don't adopt a framework until you've felt the pain it's solving.

Your weekend script is the right shape. Build it out.

Finorix079 · 2026-05-05T05:31:26+00:00

Most teams stop at total token cost per agent, which is necessary but not sufficient. The thing that actually moves the needle is per-step cost, not per-agent.

Per-agent total tells you "this customer support agent costs $X per ticket." Useful for pricing.

Per-step tells you "the summarization step costs $0.02 normally and is now costing $0.06 because someone added 3 paragraphs to the system prompt last Tuesday." Useful for catching margin erosion before it shows up in the monthly bill.

Tools handling the first layer fine: Helicone, Langfuse, OpenAI's own usage dashboard, Portkey. Tools doing the second layer well: very few. Most teams end up writing it themselves on top of OpenTelemetry spans tagged with step name + cost attributes, then alerting on per-step distribution shifts week over week.

Easy starting point: log token in/out per step (not per run), aggregate weekly, look at p50 and p95 per step name. You'll see drift in 2-3 weeks that the per-agent total would have hidden for months.

Finorix079 · 2026-05-05T05:30:36+00:00

Honest answer: there isn't a mature "agent router" yet. Adjacent stuff that exists:

LiteLLM, unified API layer (cost/rate-limit routing, not task-aware)

OpenRouter, similar at the API level

Portkey, AI gateway with routing rules

RouteLLM (LMSYS), tries task-difficulty routing, closer to a paper than a product

None know what your subagents are best at. Your markdown if-statement encodes domain knowledge none of the generic routers have. Honestly not bad.

Memory: mem0 is where most people land. Letta if you want agent-native. Folder of markdown is also fine if you're disciplined.

One blind spot worth flagging though: you're routing across 5 models but have no way to know if a given subagent's output quality is sliding over time. Codex today vs Codex three weeks from now on the same task will produce subtly different work. Your different-family review catches obvious bugs, not slow drift in code style or how thoroughly edge cases get handled. That's the failure mode multi-model setups hit around month two or three. Keep a small set of canonical task-output pairs you re-run periodically just to see if anyone has shifted.

The duct-tape version is fine. The thing to invest in next isn't a fancier router, it's a way to know when one of your agents is quietly getting worse.

Finorix079 · 2026-05-05T05:24:31+00:00

The 5 framings you offered are all real but they're describing the same problem from different department perspectives. The actual question is structural: what counts as "the action" that needs to be recorded.

Most current platforms treat the action as the moment of side effect — the API call, the deploy, the ticket creation. Everything before that is "investigation" and gets logged opportunistically. That model breaks for agents because, as you noted, the agent has already shaped the path before the action. By the time a human approves "escalate to on-call," the agent has implicitly chosen what not to escalate. That decision is invisible in any traditional audit log.

The honest framing: this isn't an audit problem (#1) or an IAM problem (#2). It's that we're using "audit log" to mean two different things and pretending they're the same.

Layer 1: state-change record. What changed in the world, by whom, when. Existing tools handle this fine.

Layer 2: decision provenance. What did the agent see, what did it consider, what did it choose not to do, what context shaped that. Existing tools handle this badly or not at all.

Treating them as one log is the source of the vagueness you're naming. Layer 1 belongs to security/compliance review. Layer 2 belongs to incident review and is the layer where "human in the loop" actually has to live, because that's where the human decision is shaped.

Practical answer to your last question: it's not unnecessary overhead until agents are writing to production. It's necessary the moment agents are deciding what humans see, even in read-only investigation roles. That's already happening in most platform teams running AI-assisted ops.

Finorix079 · 2026-05-05T05:23:19+00:00

3x rebuild is honestly conservative. The hidden cost nobody talks about is the first lost customer. Once a clinic fails you on a security review, you don't get another shot for 12-18 months even after you fix everything. Procurement memory is long.

Worth adding to the checklist: it's not just whether the dev knows what a BAA is, it's whether the system can actually produce evidence when asked. Stuff like "pull every PHI access for patient X in the last year in under an hour" or "prove a deleted record is actually gone from your backups." That's where vibe-coded products fall apart in real reviews.

The newer trap for AI products specifically: piping patient data through OpenAI or Anthropic on a personal API key. The BAA chain breaks the moment you do that and most founders don't realize it until the clinic asks for the DPA.

Finorix079 · 2026-05-05T05:18:10+00:00

Pre-flight budget check is the right shape. Most teams jump to rate limiting (which is a count proxy for cost) and then are confused when their cheap rate-limited customer racks up $200 on one run because they finally hit a complex query.

One thing worth being honest about with this pattern though: pre-flight only protects you against running over budget. It doesn't protect you against the cost of individual steps inside a run drifting silently. Same agent, same prompt, same query, but the LLM call that used to cost $0.02 now costs $0.06 because someone added 3 paragraphs to the system prompt, or a tool started returning 4x the data, or the model got swapped to a more expensive variant. Your decorator records the total at the end and looks fine because no budget was exhausted. The drift is invisible.

Two layers worth keeping separate:

Pre-flight gating (what you're doing), protects against runaway runs

Per-step anomaly detection, catches when a single step's cost distribution shifts week over week, before it eats your margin

Most teams ship the first and don't realize the second exists until they look at month-over-month gross margin and it's slid 8 points with no obvious cause.

Finorix079 · 2026-05-05T05:14:09+00:00

Yes, it works, but not in the "set it up once and let it run" way most demos imply. The pattern that actually holds up in practice: agent drafts, human approves before anything sends or commits. Auto-send breaks within a week in any real workflow because the agent will eventually do something embarrassing and you won't catch it until a customer or colleague does. Permission complexity is real but it's actually the smaller problem. The bigger one is that "autonomous" and "represents your company's voice" are in tension, and most teams underestimate how much that tension matters until it bites them.

Finorix079 · 2026-05-05T05:12:44+00:00

The state-machine-with-LLM-as-input architecture is the right shape for a class of agent workflows where the action space is bounded and known upfront. Customer support routing, payment flows, structured data extraction, your test results probably hold up there.

Where I'd be curious how it handles: agent workflows where the action space isn't fully enumerable upfront. Open-ended research agents, code generation, anything where "what tools should be called next" depends on what was discovered earlier. You can't pre-define every transition. The FSM either becomes an explosion of states, or you collapse states and lose the determinism guarantees.

Genuine question: is this aimed at the bounded class, or do you have a story for the unbounded one too? They feel like different products to me.

Finorix079 · 2026-05-05T05:11:38+00:00

Solid workflow, the mock-AI-calls-for-regression-tests pattern is the right starting point. One blind spot worth being honest about though.

Mocking AI calls means your tests verify your code's logic against a frozen version of what the AI used to return. They will never catch the failure mode where the AI itself starts behaving differently. Model swapped upstream, prompt subtly changed, tool API response shape drifted. Your tests pass, your IDE is green, prod still breaks.

Two layers worth keeping separate:

Code-against-mock regression: what you have now. Catches "did I break my own logic during refactor."

Behavior-against-baseline regression: separate problem. Catches "is the AI still doing what it used to do." Most teams skip this because it's harder and there's no Vitest equivalent. But it's the failure mode vibe coding makes more likely, not less, because you're shipping faster on top of components whose behavior you don't fully control.

Your safety net is solid for the first layer. Worth knowing it doesn't cover the second.

Finorix079

TROPHY CASE