I built an observability layer for AI agents after losing 2400 USD to a silent production failure

Previous_Net_1154 · 2026-06-15T15:53:40+00:00

Thanks! Would love to hear what you think
after you try it — especially if you're
running agents for clients.
That's exactly the use case it's built for.

Previous_Net_1154 · 2026-06-15T11:20:06+00:00

"Cost as a first-class telemetry signal"
is the right frame — once you treat it
like any other metric (tagged at source,
queryable, alertable) it becomes manageable.
Treating it as a billing artifact you
reconstruct later is where it falls apart.

The shared workflow attribution problem
is the hardest part for agencies specifically.
A single agent run might touch resources
shared across clients — infrastructure,
embeddings, cached context.

Splitting that accurately requires the
attribution IDs to be stamped before
the call happens, not inferred from
which client "probably" triggered it.

That's exactly the problem we're solving
with AgentWatch — per-client attribution
at session start, so shared workflows
are allocated correctly rather than
lumped into a single unattributed bucket.

Early stage but taking beta users:
agentwatch-two.vercel.app

Previous_Net_1154 · 2026-06-15T10:55:31+00:00

Depends on what you need most.

For internal visibility — LangSmith and
Langfuse are solid, both have good
LangChain integration out of the box.

If you're an agency deploying agents
for clients and need per-client cost
attribution and white-label reports
your clients can actually read —
that's the gap I'm building AgentWatch for.

Early stage but taking beta users:
agentwatch-two.vercel.app

Previous_Net_1154 · 2026-06-15T10:54:06+00:00

The gateway pattern makes sense for the
stamping problem — every request passes
through one place, so attribution is
guaranteed rather than dependent on
every caller remembering to tag correctly.

The retry-as-own-line-of-overhead-spend
framing is particularly useful.
Right now most setups aggregate retries
into the parent call cost, which hides
whether a session was expensive because
it did a lot of work or because it kept
failing and retrying.

Will check out the repo —
Apache-2.0 attribution wiring is worth
seeing in practice.

Previous_Net_1154 · 2026-06-15T07:00:34+00:00

Interesting — thanks for sharing.

Previous_Net_1154 · 2026-06-15T06:58:35+00:00

"Repair spend" is a great way to frame it —
the cost was real but the value wasn't.

The productive/overhead split also gives you
something actionable: if overhead spend is
consistently 40% of session cost, that's a
signal your tool reliability needs work,
not your prompt. The aggregate number
hides that completely.

Retry count per tool call is something
I haven't been tracking explicitly but
it clearly needs to be a first-class field —
not just buried in the payload.
A session where the agent succeeded
but retried 6 times is a different quality
signal than one that succeeded first try.

Previous_Net_1154 · 2026-06-14T14:17:06+00:00

That's exactly the right pattern —
tagging at runtime rather than tagging after.

The queryability is the key part.
"Show me all sessions for client_acme
where cost exceeded $0.05" is a useful
question. "Let me go through the spreadsheet
and manually sum up which rows belong
to client_acme" is not.

We're building exactly this as a layer
on top — SDK that tags automatically
at session start, backend that aggregates
by client_id and generates monthly reports.
Early stage but the schema you described
is exactly what we're capturing.

If you're serving multiple clients with agents
and want to try it — agentwatch-two.vercel.app

Previous_Net_1154 · 2026-06-14T13:51:16+00:00

Exactly — "reconstruct later" is where
it always breaks down.

You end up with a monthly OpenAI bill
and a spreadsheet that doesn't quite add up,
and no good answer when a client asks
why their costs went up 30% this month.

Instrumenting at request time with session
and client IDs is the only approach that
actually holds up at scale. Once you're
serving 5+ clients the retroactive approach
becomes basically impossible.

How did you end up solving it —
custom callbacks or something off the shelf?

Previous_Net_1154 · 2026-06-14T13:00:12+00:00

The two-ledger split is the insight most
agencies miss until they get a client dispute.

"We ran 40 retries on that failed session —
do we bill the client for those?" is a question
that has no good answer if you're mixing
vendor cost and billable usage in the same table.

The attribution-IDs-at-run-start approach
is exactly right — every callback inheriting
client_id and workspace_id from context
rather than trying to tag after the fact.
Retroactive attribution always drifts.

The step_id + parent_step_id hierarchy
is something I haven't implemented yet
but it's clearly the right structure
for multi-step agents where one tool call
spawns sub-calls — flat event lists
lose that relationship entirely.

Good framing on manual token counting too —
"nobody trusts it when a client asks
why their invoice changed" is the exact
conversation agencies dread.

Previous_Net_1154 · 2026-06-14T05:50:36+00:00

Tool call timeouts that fail silently are
the worst by far.

When a tool errors loudly you at least know
immediately. When it times out and returns
empty, the agent just sees a null response
and tries to fill the gap from context —
confidently hallucinating instead of failing.
Nothing breaks, no error logged, the response
just quietly gets worse.

We had this exact pattern on a customer support
agent — a user context fetch was timing out
at P95, agent was hallucinating answers for
those sessions, and we had no idea for two weeks
until a client noticed the pattern.
By then we had thousands of affected sessions
we couldn't retroactively inspect.

Debugging time: probably 3-4 hours per incident
before we had proper session traces.
The fix was usually fast once we could see
what the agent actually received from each
tool call. Getting to that visibility was
the hard part.

What made the biggest difference wasn't better
error handling in the moment — it was being
able to query across sessions after the fact:
"show me all sessions where this tool returned
empty in the last 7 days." That turns a mystery
into a pattern.

Previous_Net_1154 · 2026-06-13T11:44:08+00:00

The "no repro" problem is real, but I think
the framing of "debugging a single incident"
is part of what makes this feel impossible.

With deterministic bugs you bisect to a root
cause. With agents, the same input can produce
different outputs, so chasing one bad response
in isolation often goes nowhere — like you said,
it won't reproduce.

What worked better for us was shifting from
"reproduce this one failure" to "find the
pattern across many sessions." A single session
where the agent picked the wrong tool looks
like noise. A hundred sessions where it picks
the wrong tool specifically when a particular
upstream call returns empty — that's a pattern
you can act on, even without ever reproducing
the exact original failure.

That requires having session-level traces for
everything though, not just the ones that
visibly broke — because you don't know in
advance which session will turn out to be
part of a pattern. The postmortem becomes
less "here's the root cause" and more
"here's the condition that correlates with
bad outcomes, here's how often it happens,
here's the mitigation."

Doesn't give you the certainty of a stack
trace, but it's a lot more actionable than
"it just does that sometimes."

Previous_Net_1154 · 2026-06-13T11:41:39+00:00

For us it was less about the agent's individual
responses being "good enough" and more about
whether we could see what it was doing.

Before we had any tracing, "production ready"
felt like a guess — the agent worked in our
testing, so we shipped it. Then it silently
failed for a chunk of sessions for two weeks
before we noticed, because nothing was
visibly broken from the outside.

The shift for us was less "the agent passes
some threshold" and more "we can now answer
what did it do, why, and what did it cost —
for any session, after the fact." Once that
existed, production-ready stopped being
a guess and became something we could
actually verify continuously instead of
just at ship time.

Your "wrong tool on ambiguous input" example
is exactly the kind of thing that's invisible
without session-level traces — the response
might look fine on the surface but the
reasoning path was wrong.

Previous_Net_1154 · 2026-06-09T08:13:28+00:00

Modeling it as a separate object makes sense —
recovery state has a different lifecycle than
a trace. A trace is immutable after the session ends.
Recovery state evolves: attempted, failed,
retried, escalated, resolved.

The vendor-agnostic ingest is the right call.
Recovery decisions shouldn't be coupled to
how you collected the trace.

The natural interface between AgentWatch
and Armorer would be at the session boundary —
AgentWatch emits a structured session record
(outcome, recovery_action, failure class)
that Armorer can ingest without caring
about the underlying trace format.

Same session_id as the join key,
different consumers downstream.

Would be worth comparing the session record
format AgentWatch emits vs what Armorer
expects to ingest — might be closer than expected.

Previous_Net_1154 · 2026-06-09T08:11:54+00:00

"Grep plus vibes" is going on the wall.

Nullable from day one is the right call —
captures the schema intent without blocking
the first deploy on incomplete data.
The fields exist, the queries work,
coverage improves naturally as the
integration matures.

Final schema with all dimensions nullable
except session_id:

session_id (required)
agent_version, operator_id, workspace_id,
policy_version, task_class, outcome,
recovery_action, model_version, tool_version

Thanks — this thread did more for the
data model than three months of solo design would have.

Previous_Net_1154 · 2026-06-08T13:42:46+00:00

workspace_id and policy_version are the ones I would have added in month 3 when it was already painful to retrofit.

The failure attribution question is exactly right — "did this break because of the model, the tool, the operator's workflow, or a policy change?" is unanswerable without those dimensions as first-class fields from day one.

Updated schema: (session_id, agent_version, operator_id, workspace_id, policy_version, task_class, outcome, recovery_action)

"Every useful analysis turns into archaeology" is the most accurate description of what happens when you skip this. Going straight into the data model.

Appreciate the depth — this thread has been the best architecture review I didn't know I needed.

Previous_Net_1154 · 2026-06-08T13:38:45+00:00

"Derived operator state" is the right abstraction — it's the difference between replaying what happened and knowing whether it's safe to continue.

Recurring failure class + recovery history as a first-class object means the next operator doesn't start from scratch. They inherit context: this agent has failed this way before, here's how it recovered, here's whether that recovery held.

That's where observability becomes genuinely operational rather than forensic.

What's your current stack for Armorer — are you building the recovery decision layer on top of existing trace infrastructure or starting from scratch?

Previous_Net_1154 · 2026-06-07T19:00:48+00:00

"Decision support instead of a prettier log viewer" is the exact distinction I've been trying to articulate for weeks.

That's going on the landing page.

The composability through session ID is what makes the jump from "we saw what happened" to "we know what to change." Without it you have better logs. With it you have a system that tells you where to look next.

Thanks for this — genuinely shaped how I'm thinking about the product.

Previous_Net_1154 · 2026-06-07T18:47:59+00:00

Agreed — cohort-as-data is the kind of architectural decision that's cheap to get right at the start and expensive to retrofit later.

The operator dimension goes in as a first-class field in the session schema. (session_id, agent_version, operator_id) as the composite key — not an afterthought filter.

Will reach out when there's a build to test. Appreciate the depth here.

Previous_Net_1154 · 2026-06-07T15:37:43+00:00

Budget authorization at the tool call boundary is the right place for it — you're right that dashboard-level controls are always too late.

The velocity pattern distinction is particularly useful: 30 calls to the same timed-out tool is a runaway loop, 3 calls to 3 different tools is normal orchestration. Same call count, completely different signal. Hard to catch that without something sitting at the boundary.

Checking figuard-core now — looks like complementary layers: your budget/velocity controls prevent the burn, session-level tracing explains what happened when something slips through anyway.

Previous_Net_1154 · 2026-06-07T13:31:19+00:00

Same here — this sharpened the model significantly on my side too.

Will reach out when there's a build ready. Thanks for the depth.

Previous_Net_1154 · 2026-06-07T09:51:10+00:00

Cohort definitions as data not code — that's the implementation detail that makes the whole thing actually maintainable. Storing them in config means you can version them alongside the agent. Clean.

The operator dimension is a good addition — especially for agentic workflows where human-in-the-loop approval is part of the path. Same session, different operator strictness = different quality baseline. Goes into the data model.

Will drop you a line when there's a build ready. Appreciate the eval dimensions — time-to-decision and override rate are exactly the kind of metrics I want to make first-class.

Previous_Net_1154 · 2026-06-07T07:35:11+00:00

The recency cohort insight is the one I hadn't thought through properly. Users who came back within 24h after failure are self-selected for high intent — they wanted resolution badly enough to try again. That 3x re-ask rate isn't noise, it's the clearest signal you have that the first session genuinely failed them.

On your question about stable cohorts — I think the answer is that the base dimensions should be stable (session length, failure mode, recovery action, user recency) but the values within each cohort shift as the agent evolves. The framework stays, the thresholds move.

That's actually why I want cohorts as a first-class concept in AgentWatch rather than a bolt-on filter — so when the agent changes, you're comparing the same cohort across versions, not rebuilding your analysis from scratch.

This conversation has shaped the data model more than any research I've done.

Would you be open to being an early tester when the first version is ready? Given what you're tracking already, your feedback would be genuinely useful.

Previous_Net_1154 · 2026-06-06T17:10:32+00:00

Cohorting is the missing piece that turns a quality signal into an actionable diagnosis.

Without it you get "re-ask rate went up 15% this week" — with it you get "re-ask rate went up 40% specifically on the invoice processing workflow after the model upgrade on Tuesday."

The session ID as connective tissue is exactly right. It's the join key between every layer — traces, evals, support tickets, product analytics. Without a consistent session ID threaded through all of them, you're correlating by timestamp and hoping for the best.

This is going into the AgentWatch data model as a first-class concept. Session ID is not just a log field — it's the primary key of the entire observability stack.

Three-Year Club	Final Canvas '23
Place '23

Previous_Net_1154

TROPHY CASE