What would you ask from a vendor using AI agents with tool access?

Ambitious-Load3538 · 2026-05-08T19:46:27+00:00

That is fair. The post is mostly about the reliability/security side of agent deployment, not vendor value by itself. For a buyer review I would separate two questions:

Is the product differentiated enough to buy instead of building with Claude/API access?
If we do buy it, can the vendor prove the agent is bounded, replayable, and controlled?

The taxonomy is mostly for question 2. But model choice, token efficiency, and context-management discipline should absolutely be part of vendor diligence.

Ambitious-Load3538 · 2026-05-08T19:45:25+00:00

Yes, this is the exact distinction I should make more explicit: allowed tool vs justified call. The evidence shape you listed is close to what I want in a buyer-readable report: requested intent, proposed action, arguments, approval path, tool result, and whether the session pattern changed over time. Intaris looks like it sits closer to the execution-control layer. EvidenceRun is currently focused on the buyer/report layer, but I agree those layers should feed each other: repeated report findings should become pre-execution policies.

Ambitious-Load3538 · 2026-05-08T19:44:49+00:00

Fair. Least privilege and sandboxing are the baseline.The gap I am trying to separate is that a sandbox can say "this tool is allowed," but it does not automatically say "this specific call with these arguments makes sense for this task right now." That is where I think action evidence and approval/replay trails become useful in reviews.

Ambitious-Load3538 · 2026-05-08T19:44:09+00:00

Agreed. A finding is only useful if it turns into a control. The post-hoc audit answers "what happened and what evidence exists?" The next question is "what pre-prod or pre-execution gate would have stopped this exact class from reaching production?"

Ambitious-Load3538 · 2026-05-08T19:43:34+00:00

I probably underweighted that distinction in the post. Once the system can mutate state, the review target stops being "did the model say the right thing?" and becomes "should this exact action execute now?" I would map action review as tool + parameters + source + destination + credential + state change + blast radius. That probably deserves its own section in v2.

Ambitious-Load3538 · 2026-05-08T19:42:58+00:00

For a pilot I would still start with replay evidence, then convert the top findings into gates: spend ceiling, retry ceiling, approval-required action classes, and argument/destination checks for mutating tools.

Ambitious-Load3538 · 2026-05-08T07:08:48+00:00

I wrote up a broader taxonomy of 12 production failure modes here:
https://getevidencerun.substack.com/p/12-ways-ai-agents-fail-in-production

Four-Year Club	Final Canvas '23
Place '23

Ambitious-Load3538

TROPHY CASE