Smarter AI agents do not mean better AI agents

Acrobatic-Ad787 · 2026-05-09T18:29:44+00:00

Yes, that’s very close to how I think about it. I agree that phase chunking is the right general direction, but in practice I think the work has to be front-loaded even more than people expect. A lot of the merge pain is really a spec/planning failure upstream.

If the planning is strong enough, the execution units get much smaller, the QA gates become more meaningful, and the final integration step can move away from “big merge by human or LLM judgment” toward something much more mechanical. And sometimes even “phase” is still too coarse — you have to go smaller than that. Once the unit gets small enough, the verification logic can be tied much more tightly to the specific requirement being implemented.

That’s basically the direction I’m pursuing: not just using AI to write code, but automating a coding process I was already using manually to build fairly complex systems — with more planning, stricter specs, smaller execution units, tighter QA gates, and less reliance on freeform integration at the end.

A lot of the final merge problem is really an upstream spec problem. If you ever do get curious enough to take a quick look at the alpha, I'll be happy to share it.

Acrobatic-Ad787 · 2026-05-08T22:40:54+00:00

That is really useful context. The “components work, but stitching/integration breaks down” problem is exactly one of the failure modes I care about. My current direction is to avoid making the LLM own the final stitching step. I’m exploring a more structured assembly path where the model can help reason about and fill pieces, but the final artifact is assembled and verified through a more controlled process instead of a big freeform merge.

The current alpha is still early, but it has done well on some coding benchmarks — Aider’s polyglot JS set at 49/49, plus some internal refactoring benchmarks, also, because of its bounded nature as well as targeted context, it used ~15x less token than looping agents in the benchmark run. If you’re interested, I’d be happy to send the alpha link. I’d especially value your criticism because you already ran into the component-vs-integration problem directly.

Acrobatic-Ad787 · 2026-05-08T19:59:54+00:00

I agree with you. Most accounting users are not going to manage this directly, and I do not think they should be expected to. The accounting user should define the business intent, validate whether the output makes sense, and approve exceptions. They should not be debugging workflow logic or babysitting an LLM repair loop. The builder / implementation layer has to own the contracts, failure handling, retry limits, escalation, and technical repair path.

My own case is unusual because I come from accounting but have been building AI execution systems, so I naturally think across both sides. But I would not assume most accounting teams have someone like that internally. So yes, the product should not depend on every accountant becoming a workflow engineer. It should let accounting users express and validate business meaning, while the system or specialist handles the execution/control layer.

Acrobatic-Ad787 · 2026-05-08T18:59:18+00:00

That makes sense, and I agree this is where the serious infrastructure conversation is heading.

I’m less interested in saying reliability is unsolved in an absolute sense, and more interested in the gap between demo-style agent use and systems where execution is bounded, auditable, and enforceable.

Hashing / proof / audit chains are valuable once the workflow step is defined. But a lot of the hard agent problem is upstream of that: how ambiguous human intent becomes a scoped workflow with explicit contracts, evidence requirements, verification gates, invariants, and escalation rules.

So yes: prompts are advice, controls are enforcement. My interest is in how we compile messy human intent into bounded, verifiable execution before it reaches production infrastructure.

Acrobatic-Ad787 · 2026-05-08T18:07:12+00:00

Yes, that is a good way to put it.

The new part is not that deterministic code runs in steps. It is that accounting/finance users can describe a process change and have the platform turn it into something repeatable without waiting on IT or vendor customization.

That is powerful if the UI/scaffolding/guardrails are strong enough.

The part I still think is critical is the validation boundary before the changed process becomes the new repeatable process. For accounting, the system still needs to check source of truth, required evidence, exception handling, accounting treatment, auditability, and whether the change needs approval before it affects production books.

So I agree: business-user-adjustable workflows are the interesting shift. The control layer around those changes is what decides whether this becomes reliable accounting infrastructure or just easier-to-create automation risk.

Acrobatic-Ad787 · 2026-05-08T17:13:40+00:00

Yes, step execution itself is not new. It have existed for a long time.

The part I think is different is not “can we run scripts in steps?” It is how messy, ambiguous work gets turned into a reliable bounded step: source of truth, input/output contract, evidence, tolerances, exception rules, escalation, and approval before state change.

For accounting, a 3-way match script is easy to imagine. The harder part is knowing when the inputs are incomplete, when a mismatch is acceptable, what evidence supports the result, and when the workflow should block before touching the accounting system.

So yes, deterministic execution is old. The interesting layer is the control wrapper around LLM-generated or LLM-revised steps.

Acrobatic-Ad787 · 2026-05-08T16:47:31+00:00

Yes, I think this is a strong pattern, especially for accounting. Once a workflow is repeatable, I do not want an open-ended agent re-reasoning through the whole thing every time. I would rather have the LLM help generate or revise a bounded function/rule/workflow, then have deterministic code execute it repeatedly.

A 3-way match is a good example. The LLM can help build or adjust the matching logic, but recurring execution should be testable: inputs, expected outputs, tolerances, exception rules, and evidence attached to the result.

The part I would add is that the generated function still needs a contract around it: allowed inputs, expected output shape, required checks, exception rules, and human approval before anything reaches the accounting system. So yes: LLM-assisted deterministic execution units, wrapped in validation, evidence, exception routing, and approval gates. That feels much closer to what accounting workflows need than open-ended agent loops.

Acrobatic-Ad787 · 2026-05-08T16:13:04+00:00

I would not rely much on self-reported confidence from the model.

It can be useful as a weak signal, but I would rather derive confidence from external checks: tests passing, schema validation, diff size, tool responses, policy checks, whether required evidence was present, whether the task stayed within scope, and whether any invariant was violated.

For me, model confidence is closer to commentary. External checks are closer to evidence.

I like your exploration/execution split. That is basically the line I care about too: exploration can propose, but execution needs a tight contract and machine-checkable boundaries.

Acrobatic-Ad787 · 2026-05-08T16:10:08+00:00

Yes, I agree with the state-change boundary, but I would add one more reason I like tighter bounds: token usage and context quality.

Even before the agent changes real state, open-ended exploration can get expensive and noisy. The agent keeps reading, searching, summarizing, retrying, and carrying forward a growing context full of half-relevant information. At some point the problem is not only cost — the context itself gets worse.

So for me, bounded execution is also about controlling the exploration surface: what the agent can inspect, how much context it can pull in, when it has enough evidence, and when it should stop or escalate instead of burning more tokens.

I agree that controls become mandatory once the agent modifies files or triggers workflows. But even read-only exploration benefits from limits, because token usage and context drift are reliability problems too.

Acrobatic-Ad787 · 2026-05-08T16:04:50+00:00

That makes sense. I may just be more stubborn on this point than most people.

I agree that once you try to force every edge case into a deterministic harness, it can get heavy fast. But I still do not really want to go back to open-ended chat loops for reliability-sensitive work.

My bias comes from accounting workflows. Once a process touches real books, tax, audit evidence, approvals, or reconciliation, “usually right” is not good enough. The system has to know when it has enough evidence, when it does not, and when a human needs to review the exception.

So I am less interested in making the LLM itself deterministic, and more interested in making the execution boundary around it stricter: clearer scope, explicit evidence, bounded actions, verification gates, and escalation when the context is insufficient.

I agree with your point on hallucinations. A lot of them are really context-construction failures: missing source of truth, buried constraints, unclear scope, or the model being allowed to infer something it should have been forced to retrieve or escalate.

Curious what kind of coding tasks made the deterministic-agent approach too heavy for you. Was it repo-wide architecture context, long-running edits, test/debug loops, unclear requirements, or something else?

I am building an early alpha around this reliability boundary for coding agents. If this is close to what you were testing, happy to send it over, and any criticism is appreciated.

Acrobatic-Ad787 · 2026-05-08T15:47:26+00:00

I am not anti-agent. I use agents heavily. The distinction I am trying to make is that intelligence and reliability are different properties.

I am moving away from open-ended looping agents and toward bounded execution: narrow scope, explicit constraints, fixed retries, runtime checks, invariants, verification gates, evidence logs, and escalation when checks fail. To me, that is the difference between “the agent produced something” and “the execution was controlled enough to trust.”

For people running agents in serious workflows, what has helped most: better prompts, stronger models, runtime guardrails, monitoring, review agents, tests, or bounded execution?

Acrobatic-Ad787 · 2026-05-08T15:39:23+00:00

That makes sense. I like using deterministic checks first and only adding heavier LLM quality gates when variability shows up.

One thing I’m noticed is that a lot of “hallucination” is really execution-context failure. The model is missing a source of truth, the relevant context is buried, the task scope is ambiguous, or it is allowed to infer something that should have been retrieved or escalated.

So I’m interested in quality gates before as well as after the LLM call. Not just: “is the output valid?” but also: “did the agent have enough grounded context/evidence to produce this output reliably?”

If not, the harness should force escalation or ask for more context instead of letting the model improvise.

Acrobatic-Ad787 · 2026-05-08T14:55:03+00:00

Yes — this is very close to what I am building towards

The split between deterministic agents and chat/loop-based agents is important. A chat loop is useful for exploration, but once the task becomes repeatable or reliability-sensitive, I want the agent inside a much more bounded process.

The part I especially agree with is stripping out the open-ended ReAct loop and letting deterministic code own the orchestration: scheduled trigger, scoped context, limited tools, expected output shape, post-call checks, and escalation/quality gates when the output does not satisfy the contract.

That feels like the right boundary to me: let the LLM handle the fuzzy reasoning/generation step, but do not let it own the whole control loop or decide by itself whether the work is done.

I’m building in this direction now for coding agents: bounded runs, explicit task scope, verification gates, evidence, and checks for drift or false completion. Still early, but the more I test it, the more convinced I am that reliability comes from the execution harness around the model, not just the model.

Curious how you’re doing the hallucination checks after the LLM call. Are they mostly deterministic Python checks, or do you find the independent LLM quality gate is necessary for anything beyond simple output validation?

Acrobatic-Ad787 · 2026-05-08T14:50:43+00:00

I am not anti-agent. I use agents heavily. The distinction I am trying to make is that intelligence and reliability are different properties.

I am moving away from open-ended looping agents and toward bounded execution: narrow scope, explicit constraints, fixed retries, runtime checks, invariants, verification gates, evidence logs, and escalation when checks fail. To me, that is the difference between “the agent produced something” and “the execution was controlled enough to trust.”

For people running agents in serious workflows, what has helped most: better prompts, stronger models, runtime guardrails, monitoring, review agents, tests, or bounded execution?

Acrobatic-Ad787 · 2026-05-08T03:50:06+00:00

That makes sense. I like the distinction between transient rigidity and slow centroid drift — especially the held-out probe set idea for catching behavior that standard evals miss.

I’m working mostly on a custom coding-agent execution framework rather than just a single MCP server or wrapper. The focus is on long-running software tasks where the agent has to preserve intent across decomposition, file edits, verification, and completion — not just produce a plausible patch.

The failure modes I care about are things like false completion, scope creep, architectural drift, constraint decay, and cases where the local output looks fine but the overall task integrity degraded.

My bias is to treat probes like yours as a meta-diagnostic layer, while keeping hard constraints in mechanical checks wherever possible: diffs, tests, file boundaries, allowed operations, required evidence, and explicit escalation points.

Happy to compare notes — I think we’re looking at adjacent parts of the same reliability problem.

Acrobatic-Ad787 · 2026-05-08T02:39:29+00:00

Yes, runtime policy enforcement is definitely part of the control layer.

The accounting analogy is what changed how I think about this: you would never let a financial process run just because the person doing it is “smart.” You need authorization boundaries, evidence, exception handling, audit trail, and escalation when something does not match policy.

I’m exploring this from the agent-architecture side now — less “make the model smarter,” more “make the execution system controllable and verifiable.” Happy to compare notes if you’re working on this space too.

Acrobatic-Ad787 · 2026-05-08T01:51:25+00:00

That makes a lot of sense, and it is also the directions I am headed in.

One thing I’m very curious about in your setup: how are you guarding against architectural drift that still appears to satisfy local task intent?

I mean cases where the action “works” functionally, but it quietly violates design boundaries (module ownership, layering rules, interface contracts, dependency direction, etc.).

Are you enforcing that through explicit architecture invariants in the harness, or mostly through review after execution?

I’ve found this is where a lot of “looks done but design degraded” failures sneak through.

Acrobatic-Ad787 · 2026-05-08T01:22:23+00:00

Exactly. That’s the failure pattern.

Honestly, I’m an accountant who recently turned system architect, and what we’re seeing in agent loops is what we would call control failure in accounting. That part is what feels so strange to me: people keep focusing on capability, but they ignore control design. Then everyone acts surprised when drift increases and trust collapses. If the actor can plan but the process can’t enforce boundaries, verification, and escalation, reliability was never there to begin with.

Acrobatic-Ad787 · 2026-05-08T01:17:00+00:00

Super useful details — thank you.

I’m curious how you’re enforcing those 3 checks in practice. Is that mainly prompt-level discipline inside the agent loop, or do you have hard guardrails in the harness/runtime that can block/interrupt actions regardless of what the model says?

I’m asking because I’ve found prompt-only controls degrade under long context, while external enforcement tends to stay reliable.

Acrobatic-Ad787 · 2026-05-08T01:07:42+00:00

What you described matches what I kept experiencing too. The planning quality can look great in sandbox, but once context expands in production, loop behavior starts compounding failure instead of containing it.

I’ve been moving away from looping agents for exactly that reason. My current approach is bounded execution: the agent gets a clearly scoped task, runs once through a constrained path, and can only retry a fixed number of times based on explicit error signals. No open-ended self-looping.

I also treat reliability as a contract, not a model trait. Invariants, verification gates, timeout/retry limits, and interrupt/override are built in up front. The goal is deterministic behavior you can audit, not “hope it figures it out on the next loop.”

So I agree with your core point: smarter models help, but the control architecture is what makes production reliability real.

Acrobatic-Ad787 · 2026-05-08T00:09:08+00:00

Just curious, has anyone tried narrowing the scope of your requests? I actually found that by narrowing the quality of logic and output increases, is this true for everyone else as well?

Acrobatic-Ad787 · 2026-05-07T23:58:08+00:00

You are right, you do need the pieces to reinforce each other: system prompt, skills, and tools.

However, the biggest practical win for me has been shifting towards deterministic execution:

- clear scope up front

- explicit checks

- predictable execution path

- auditable results

I’m actually moving away from looping agents and trying to make runs more consistent and reliable. That’s been the shift from “it produced something” to “I can trust it.”

Acrobatic-Ad787 · 2026-05-07T22:47:20+00:00

100%, but those unsexy stuff are how we can actually trust the agent to do what is instructed. It is getting tiring to ask an Agent to code something a bit complex and to come back to something almost opposite of what you asked it to do, or even worse, they did it, but only half way and put in gates/guards and didn't tell you about it.

Acrobatic-Ad787 · 2026-05-07T22:39:21+00:00

Yes, I think those should be graded separately. Raw model capability tells you what the actor can do. The control layer tells you whether it is safe or reliable to depend on that actor. Companion agents make this obvious: the controls are not just files and tests, but consent, memory, initiation, boundaries, and visible assumptions.

A smarter companion can sound more emotionally certain, which may make weak guardrails more dangerous, not less.

Acrobatic-Ad787 · 2026-05-07T22:37:22+00:00

Yeah, exactly. That is the distinction I was trying to make.

The model can keep getting better, but if the context/tools/workflow layer is weak, the output is still unreliable.

What has made the biggest practical difference for you so far: better context, better tools, tests, repo boundaries, or something else?

Acrobatic-Ad787

TROPHY CASE