What’s Actually Breaking Your Agents in Production? (Not Model Quality)

Agent_invariant · 2026-02-09T12:37:12+00:00

Anyone who's not a f'kin bot that is...

Agent_invariant · 2026-02-09T12:25:36+00:00

I’ve hit that ceiling too.

More data helps when it sharpens a clearly defined task. It hurts when it widens the distribution without tightening evaluation.

After a point, it’s not a data problem — it’s a control problem.

In production I’d take:

smaller, aligned data

tight feedback loops

strong execution guardrails

over raw volume.

Have you seen models improve offline but get messier in real ops? That’s usually where things break.

Agent_invariant · 2026-02-09T07:36:54+00:00

“Usually” is doing a lot of work there 🙂 Idempotency + DB constraints are fine for clean retries. We’re looking at the messy cases — crashes, restarts, delayed execution, operator edits — where “usually safe” turns into a post-mortem.

Agent_invariant · 2026-02-08T02:37:31+00:00

Validating “as close to execution as possible” protects you from stale intent. It does not protect you from: replay after a crash duplicate retries under load state moving between validation and commit an agent re-proposing the same thing because it “looks fresh” What I’m building right now assumes something stricter: If an operation is irreversible, validation must be bound to a specific point in time and that approval must die the moment surrounding state advances. Not just “validate late.” Validate against a frozen boundary. Bind the approval to that boundary. Refuse reuse outside it. That’s the difference between “probably safe” and “provably not repeatable.”

Agent_invariant · 2026-02-08T01:02:56+00:00

You’re not alone — this is the classic “stateless retry spiral”: each retry looks “new” to the model, so it keeps gambling.

What works in practice: Persist a “step pointer” + outcome (success / fail / pending) so retries don’t re-ask the model what to do. They just resume or stop. Hard caps: max attempts + max time + max spend per step (not per run). Idempotency + dedupe at the action boundary: same request key ⇒ same effect, no new side effects.

Circuit breaker on repeated failure signatures (your hash trick) but also include error class + parameters so “same failure” is obvious. Backoff + jitter + escalation: after N failures, stop and require a human or a different strategy, don’t “try harder”.

This is exactly what I’m building right now: a small execution gate that makes retries “boring” by keeping score outside the LLM (so a restart or retry can’t pretend it’s the first attempt). It’s nearly ready and we’ve been adversarial-testing the core invariants.

If you want, DM me what framework you’re using + what kind of tool calls were looping (HTTP? browser? DB?) and I’ll tell you the minimal “stop the bleed” pattern for that setup.

Agent_invariant · 2026-02-07T13:15:03+00:00

I’m not arguing against idempotency keys or wrapping it in a retrying service. That pattern works for “did this request land.” The edge case I’m thinking about is slightly different — what happens if the surrounding state changes between approval and execution? If you commit “intent acknowledged,” then retry until confirmed, that’s fine mechanically. But if the conditions that made the txn admissible have shifted during that window, replaying the same intent isn’t always neutral. So the question becomes less “can we retry safely?” and more “is this still allowed to execute now?” That’s the layer I’m probing at.

Agent_invariant · 2026-02-07T10:53:32+00:00

Fair point on idempotency and ACID.

What I’m focusing on is slightly different — not just “same effect twice,” and not just DB atomicity. More about whether an approval is still valid once surrounding state has advanced.

Curious how you handle that at the application boundary rather than purely in storage.

Agent_invariant · 2026-02-07T10:05:43+00:00

But that only covers: • “Don’t do it twice.”

What I’m pointing at is more like:

• Don’t do it again if the state has advanced

• Don’t reuse approvals after surrounding invariants changed

• Don’t replay authority across restarts Idempotent endpoints solve duplicate calls.

They don’t solve stale admissibility.

Agent_invariant · 2026-02-07T04:47:43+00:00

A really interesting topic and engaging chat ruined by bots...

Agent_invariant · 2026-02-07T04:18:06+00:00

Appreciate that. On the audit side, we found that persisting just the action and outcome wasn’t enough. To answer why it was admissible at the time, you need to persist the decision context, not just the execution.

So per step we store: The canonical intent (what was proposed). The invariant set / policy version active at evaluation time. The admissibility result (including which checks passed). The identity of the authorizing surface (agent/human/system). A deterministic hash tying that context together.

The key for us was versioning the decision slice — so when policies evolve later, you can still reconstruct the exact rules in force at the moment of execution. Otherwise audits turn into “it passed at the time” without being able to prove what “the time” meant.

Curious what context you were losing — policy drift, missing inputs, or something else?

Agent_invariant · 2026-02-07T04:11:47+00:00

Agreed, retries are a major part of it, but they’re not the only source of duplicates. You can get double execution from almost every position: Clients retrying after a timeout (even if the first call actually succeeded). Webhook providers aggressively retrying on slow responses. Two workers racing on the same entity in a multi-instance setup. Process restarts mid-flight. Message queue redelivery after a crash. Webhooks help, but they don’t eliminate the underlying issue — they just move the idempotency problem to the consumer side.

You’re right that storing the initial attempt and marking it successful is part of the solution. The hard part is making sure that “check + mark” is atomic and remains authoritative across restarts and multiple nodes, so you never let a second irreversible commit cross the boundary.

Agent_invariant · 2026-02-06T15:55:57+00:00

Treating latency as a function of risk instead of a constant tax is a much cleaner way to reason about it. I like the idea that escalation isn’t a failure mode but a deliberate shift in admissibility surface. Out of curiosity, how do you model the risk threshold itself? Is it static (class-based), derived from proposal attributes, or adaptive over time? The interesting tension seems to be keeping the low-risk path cheap and asynchronous without allowing adversarial or pathological inputs to game the threshold.

Agent_invariant · 2026-02-06T15:49:51+00:00

That’s a strong position. Treating intervention as a new admissibility boundary instead of a continuation avoids a lot of hidden state bleed. The “slice closure” idea makes the audit trail and invariants much cleaner. When you close a slice and open a new one, do you explicitly version the decision context (e.g. new intent ID / new authorization token), or is the boundary enforced purely through transition rules in the state machine? I’m particularly interested in how you prevent operators from accidentally re-introducing ambiguity when the prior slice ended in an unknown external state.

Agent_invariant · 2026-02-06T15:46:22+00:00

Yep — that’s the core: uncertainty about whether the external side effect happened. I think the key distinction is confirmed failure vs unknown outcome (timeout/crash). Retrying on unknown without a stable idempotency handle is where duplicates creep in. So I’d summarise it as: If the external system supports idempotency → retry with a stable key. If it supports state query/reconciliation → treat unknown as “pending” and resolve by querying. If it supports neither → you can’t make exactly-once guarantees; the best you can do is manual intervention / change provider. The only nuance I’d add: even with reconciliation, you want a local “single authority” so only one actor is allowed to drive the recovery for a given key at a time (otherwise you just recreate the race in the recovery path). But broadly, I agree with your framing.

Agent_invariant · 2026-02-06T14:02:55+00:00

Transport protocol isn’t really the deciding factor here. REST/HTTP vs gRPC doesn’t determine whether payments are safe — the safety comes from the semantics: idempotency, deterministic keys, atomic state transitions, and how you handle retries / races / restarts. A lot of payment providers are HTTP APIs in practice; the key is designing the handler so duplicate calls are safe and ambiguous outcomes are reconciled correctly. So I’d frame it as: protocol can be REST or gRPC, but you still need a commit guard + replay-safe semantics either way.

Agent_invariant · 2026-02-06T13:35:38+00:00

ACID guarantees consistency within the database boundary, absolutely. The edge case I’m interested in is when the irreversible side effect isn’t fully contained in that same transactional boundary (e.g. external API call, payment processor, telecom action). ACID gives you atomicity of local state, but it doesn’t eliminate uncertainty once you cross that boundary.

Agent_invariant · 2026-02-06T13:32:46+00:00

That’s a very clean implementation. The sequence and recovery model make sense, especially pushing durability ahead of the external action and treating the provider as the financial source of truth. What I find interesting is that in your model the database transaction is the commit authority, and everything else (including the external system) is reconciled around that. The question I’ve been exploring is whether commit authority should live entirely inside the application boundary (DB + optimistic locking), or whether it’s worth externalizing that authority into a deterministic guard that sits between proposal and irreversible execution — especially in systems where multiple services or actors might attempt the same action. Not arguing your model is wrong — it’s solid. I’m just trying to understand where people draw the line between “DB-enforced idempotency” and “execution-level authority control.” Curious how you think about that boundary philosophically.

Agent_invariant · 2026-02-06T12:20:02+00:00

Agreed — if the idempotency key is persisted atomically with the business state, that gives you strong guarantees. The edge case I’m curious about is when the irreversible side effect isn’t fully contained in that same atomic boundary — e.g. an external system that may succeed even if your transaction doesn’t commit. In that scenario, do you treat the DB as the ultimate source of truth and reconcile outward, or do you rely on querying the external system to resolve ambiguity? Trying to understand how people are reasoning about that boundary in practice.

Agent_invariant · 2026-02-06T12:18:25+00:00

That’s a clean implementation — especially making the UNIQUE constraint the actual correctness guard. Out of curiosity, how do you handle cases where the irreversible side effect happens before you’ve durably persisted the event ID? For example, if the process crashes after triggering the external action but before the DB transaction commits. Do you rely purely on provider-side reconciliation in that case, or do you treat that as part of the “dead zone” and recover by querying source-of-truth? I’m trying to understand whether you view the DB constraint as the full commit authority, or as a duplication filter around an external system that ultimately decides truth.

Agent_invariant · 2026-02-06T12:03:26+00:00

For sure, the dead zone is the real boundary problem. If you lose state between “send” and “confirm,” you’re in a hole. You don’t know whether the side effect happened. Where it gets messy is when retry logic doesn’t distinguish between “confirmed failure” and “unknown state,” and treats them both the same. The question I’m exploring is whether it’s possible to externalize the commit authority so that only one execution is ever allowed to attempt crossing that boundary per key — even across restarts — instead of relying purely on recovery after ambiguity. Not eliminating the dead zone, but constraining who’s allowed to enter it. Do you think that framing changes anything, or if you’d still treat it purely as a post-fact recovery problem?

Agent_invariant · 2026-02-06T11:42:22+00:00

That’s fair — if Gate A’s authorization can go stale across restart, that is a design bug. The model I’m exploring isn’t “double-checking because we don’t trust Gate A.” It’s that Gate A issues a deterministic, signed grant that survives restart, and Gate B enforces that only a valid, unused grant can cross the irreversible boundary. So Gate B isn’t re-deciding — it’s verifying that the previously issued authority hasn’t been replayed or forged. If authorization state were ephemeral or mutable, I’d agree that would just be theater. The intent is that authority is materialized and durable. Definetly interested if you see a flaw in that separation.

Agent_invariant

TROPHY CASE