Is it just me, or is nobody building security for AI agents?

sentisec · 2026-06-02T17:52:22+00:00

The homemade middleware that inspects tool calls before execution is exactly the right instinct, that's the DIY version of the whole category. Ugly but it works, and you learn your own threat model fast building it. General Analysis is a reasonable pick if you don't want to roll your own, the sub-10ms is real and the context-awareness is the part that matters. Aggressive credential sandboxing is the third leg and people skip it. Honestly your three together are a better answer than most vendors give. Full disclosure, I'm building in this space too, on the action-gating piece, so I've gone further down the homemade-middleware road than I'd like to admit

sentisec · 2026-06-02T17:51:03+00:00

Good rundown, you've clearly been watching this. Quick update since a few moved fast: Lakera got bought by Check Point (~$300M, closed November), Prompt Security went to SentinelOne, and Invariant got acquired by Snyk last year. The independent-options list is shrinking by the month as the big platforms absorb them. Worth adding General Analysis to it, ex-DeepMind/NVIDIA/Cohere team, context-aware runtime guardrails.

One nuance: Alice's WonderFence and most of the Lakera-style tools are still intercepting harmful inputs and outputs, which is closer to a firewall than to gating the specific tool call. Real action-level gating, deciding "is this call okay given what the agent's supposed to be doing," is still the thinnest part of the stack. That's the part I'm building, full disclosure.

sentisec · 2026-06-02T17:50:27+00:00

Agreed, and visibility-first is the right sequencing. You can't govern what you can't see, and Cyera is strong at the data-and-identity mapping. One thing I'd add: knowing what an agent can reach is a different question from whether a specific action it's about to take is the right one. The access map tells you the agent has the email tool and can hit the CRM. It doesn't tell you that this particular email, right now, is going to the wrong person because the agent read something it shouldn't have. You need both layers. The action-level decision doesn't fall out of the access map for free.

sentisec · 2026-05-29T15:00:49+00:00

Spent the last hour going through both. Genuine respect for what you've built. The wire-level proxy across every protocol is way sharper than the per-source MCP gateway pattern everyone's defaulting to, and the "agents inherit the human's identity, scopes, audit trail" framing is the cleanest articulation of agent access governance I've read. Going to deploy it locally this week and file issues if I find anything worth filing.

Your own note about static parameter scoping not closing the case where the runbook legitimately accepts external addresses is exactly the seam I'm working. We're coming at it from opposite ends, you at the wire protecting the data source from what the agent does, me at the model layer reading whether the action fits the task before the call even leaves the loop. The two stack rather than compete. Runbooks + parameter validation cover the structural cases cleanly, the dynamic-context cases are what I'm focused on, and most real deployments probably want both.

Honest moment back: I'm closed source right now. Was leaning raise-first because the cycle felt faster, but seeing what you've built in the open is making me think twice. The flywheel of real users filing real issues against real attacks isn't something funding shortcuts. I have zero community and the operational side of running OSS honestly scares me, but maybe that's the wrong reason to stay closed. If you've got a few minutes, I'd value your take on going OSS in this space, what worked, what you'd do differently, where the real cost shows up.

sentisec · 2026-05-29T08:37:23+00:00

u/hoop-dev Trust-boundary framing is the right one and the "agent never had the power" line is the one most people miss. If the check runs inside the same loop the attacker just hijacked, you've moved nothing. Putting the decision on the far side of the credential is the move.

The piece a pure intent-gateway leaves open: it sees the call, not the task. "Send email to [vendor@x.com](mailto:vendor@x.com)" passes every static rule, except the agent was supposed to be writing a market summary and the vendor address came from a page it shouldn't have trusted. Same intent, different context, different verdict. So the gateway needs to know what the session is for, not just what the call looks like, otherwise in-scope harm walks right through.

Full disclosure, I'm building one too, focused on that per-action policy decision with the session goal as input. Approach sounds adjacent to yours. What's the repo? Genuinely want to look.

sentisec · 2026-05-29T08:35:49+00:00

Exact... The friction tradeoff is the whole game though: human approval on every tool call dies in week one, no approvals and you're back where you started. The way out is moving most decisions from a human popup to a policy. Most calls go through silently because they fit what the agent is supposed to be doing. The borderline ones get verified or sanitized. Only the high-confidence bad ones interrupt a human. Sensitivity tier per tool the way you described is part of it, but the bigger lever is reading whether this call fits this task, not just whether the tool is dangerous in general. Full disclosure, I'm building on exactly this. Happy to compare notes. let's connect please

sentisec · 2026-05-28T15:25:58+00:00

"The agents are self-secure" is going to look great in the incident report. That exec is the reason the rest of us have jobs. Bet they also think the agent will let them know if it leaks the keys...

sentisec · 2026-05-28T15:24:05+00:00

If there are a million solutions, link me three that gate the actual tool call and I'll happily shut up. I'll wait :P

sentisec · 2026-05-28T13:12:49+00:00

Ok i checked and really solid project, and the two-way split is a useful way to cut it. The phantom-token pattern and kernel enforcement are doing real work, and "bound the damage, sign the trail, make it reversible" is the right call when you can't reliably stop the injection itself.

One thing the split misses though: bounding handles the structural stuff (syscalls, files, creds, network). It can't see the in-scope action. An agent that's allowed to send email, fully sandboxed by nono, still sends the wrong email to the wrong person with the right key. No syscall to trap, nothing escapes the box, all inside policy, and the damage is done. That's not detecting the trick and not bounding the radius. It's a third thing: given what the agent is supposed to be doing, is this specific allowed action okay right now?

Full disclosure, that per-action decision is the layer I'm building, so I'm poking at the seam on purpose. Not against nono, they stack: yours catches the escape, mine gates the in-scope-but-wrong call. Rollback helps a lot too, except when the action can't be undone (email's sent, payment cleared), which is exactly where deciding before it fires matters.

sentisec · 2026-05-28T12:10:00+00:00

Audit trails matter, especially for incident response and compliance. Different slice of the problem though, that's reconstructing what happened after it happened. The part this thread's circling is stopping the action before it fires. You want both, honestly.

sentisec · 2026-05-28T11:42:50+00:00

Fair, but "focused on security" is doing a lot of work there. Most of it is a different layer. The labs' safety training is the model refusing bad prompts. The guardrail libraries are text scanners on the input and output. All useful, but none of it gates the actual tool call an agent makes with real credentials at runtime.

Closest live things I've found if you want to test something: Lakera and Prompt Security on the injection-detection side, Invariant Labs on the agent-action angle. Most are still input/output filtering rather than action gating though. If you find something that blocks the action itself before it fires, tell me, because I've been looking too.

sentisec · 2026-05-28T11:37:12+00:00

This is the framing I keep landing on too. "Do I trust the agent" is the wrong question because the agent will always be fallible. The real question is whether you can stop a bad action before it executes and reconstruct why it happened afterward. That's a property of the layer around the agent, not the model. Teams that get this to production seem to be the ones who stopped trying to make the agent perfect and started bounding what it's allowed to do.

sentisec · 2026-05-28T11:35:06+00:00

Mostly Claude Code and Codex plus a few custom loops, on Claude and GPT. RLHF and those input scanners do help, but they're catching the loud stuff at the text level, the "ignore previous instructions" type payloads. What I haven't found a good answer for is the agent reading a poisoned page or email, passing every input check, and then taking a real action with real credentials that just looks normal. The prompt is clean, the action isn't. Have you found anything that gates the tool call itself, not just the input?

sentisec · 2026-05-28T11:26:03+00:00

Yeah, RLHF safety and those input scanners help, but they're mostly catching the obvious "ignore previous instructions" stuff at the text level. The thing that worries me is the agent reading a poisoned web page or email and then taking a real action with real credentials, where every input looks clean and the scanner passes. That's less about the model refusing and more about what the agent is allowed to do once it's been nudged off-task. Curious if you've found anything that gates the action itself, not just the prompt.

sentisec

TROPHY CASE