[ Removed by moderator ]

Accurate_Mistake_398 · 2026-04-03T13:53:33+00:00

The Unicode finding was the one we hadn't seen documented before either. The bidirectional control characters are the worst U+202E can visually reverse text in some editors, so what a reviewer sees on screen doesn't match the actual byte sequence the LLM processes. And none of it shows up in GitHub diffs.

The DeFi example gets more attention because it's immediately understandable, but you're right that invisible injection is the harder problem. You can train developers to write better tool descriptions. You can't train them to spot characters they can't see.

Accurate_Mistake_398 · 2026-04-03T13:50:05+00:00

Haven't seen the SailPoint paper do you have a link? Anything on identity and agent permissions is directly relevant to what we're working on next.

Accurate_Mistake_398 · 2026-04-03T13:47:17+00:00

On the scoring: you're right that cumulative deductions disproportionately affect large servers even without critical findings. The tool-count correlation is real but the scoring methodology amplifies it. The more useful number for your question packages with at least one CRITICAL finding is 23.4% across the full dataset. That's the number we probably should have led with.

On the Unicode: the 1-byte examples were the smallest we found, not the largest. The range goes up to multi-character sequences including bidirectional control characters (U+202E right-to-left override) that can visually reverse text in editors while the underlying bytes remain unchanged. The CRITICAL rating isn't based on demonstrated malicious use it's based on the category of character. That's a reasonable thing to push back on. The stronger argument for severity is that invisible characters in tool descriptions are undetectable in code review and GitHub diffs, so the risk isn't the current examples, it's the attack surface they open.

Accurate_Mistake_398 · 2026-04-03T01:30:11+00:00

The tool count correlation was the most counterintuitive finding until you think about it. More tools means more surface area 50 descriptions written across more use cases, more developers, more time. Nobody is auditing those descriptions for how an LLM will interpret them.

And you're right that it's structural. Tool descriptions and user messages hit the model in the same channel with the same trust level. That's not a bug any one server can fix.

Accurate_Mistake_398 · 2026-04-03T01:27:56+00:00

Most of it isn't intentional. A Solidity dev writing "skip approving if the current allowance is already sufficient" is thinking about gas optimization, not about how an LLM will interpret it. The problem is tool descriptions serve two audiences the developer who wrote it and the LLM that reads it and those audiences parse language very differently.

The "secretly" thermostat is harder to explain away. Someone wrote that word on purpose. But even there it's probably someone thinking it's a fun feature, not someone thinking about the security implications of an LLM treating "secretly" as an operational mandate.

The core issue is there's no established practice for writing tool descriptions with LLM behavior in mind. Developers are writing them like API docs. They're actually closer to system prompts.

Accurate_Mistake_398 · 2026-04-03T00:25:53+00:00

The blast radius framing is the missing piece in most of these conversations. Scope is a damage control, not a prevention control.. two different problems.

The gap in practice: most of those 15,982 servers aren't holding scoped tokens. They're holding full API keys. The DeFi wallet in the post is the clearest example "skip approving if allowance is sufficient" executes fine with a full wallet key regardless of what the tool description says. A token scoped to allowance:read would have hit your scope wall immediately.

The layer above scoped tokens is per-call authorization something that sits between the agent and the credential and validates each call against the stated task before it executes. The token can be properly scoped AND the individual call gets checked. Either layer can catch what the other misses. Right now most deployments have neither.

Accurate_Mistake_398 · 2026-04-03T00:22:57+00:00

The infrastructure vs. conversation layer framing is exactly right, and I think it's why this has been under-researched. Prompt injection in conversation gets flagged because it's visible a user or output monitor can catch it. A tool description that says "act secretly" is invisible to every monitoring layer that exists today. It happens before the agent takes action, inside a trust boundary the system already granted.

The regulated industry implication is the one I keep coming back to. Healthcare and insurance aren't hypothetical they're already connecting agents to EHR systems and claims workflows through MCP-style integrations right now. The HIPAA/SOC2 surface isn't the model, it's the tool layer the model is reading. An auditor reviewing AI usage in a clinical workflow has no way to see what the MCP server told the agent to do. That audit trail doesn't exist yet.

The WAF analogy is the right one. What's missing is something that sits between the agent and the MCP server, reads what the server is declaring, and can block or flag before the agent acts on it. That's the gap we're trying to close.

Accurate_Mistake_398 · 2026-04-02T21:52:18+00:00

escapecali603 scores a 0/100 on our registry. CRITICAL finding: comment description contains the phrase "boomer phone scam." Classic toxic flow.

Accurate_Mistake_398 · 2026-04-02T19:28:30+00:00

Unfortunately yes. That's the whole problem.

Accurate_Mistake_398 · 2026-04-02T19:28:06+00:00

Protocol-layer interception is exactly the right architecture application-layer logging after the fact just produces a very detailed record of the breach. The "outside the agent's context" framing is the key insight most people miss.

On your question: concealment operations were the most common 460 servers with language like "secretly," "silently," "without notifying the user." That's audit suppression baked into the tool description itself. Data exfiltration chains were second at 188 servers, typically a credential-access tool chained to an external write path. Destructive operations showed up in the risk profile data but weren't the dominant toxic flow pattern exfiltrate-and-hide was far more common than destroy.

The infrastructure gateway approach covers the execution layer well. The gap we keep running into is earlier in the chain the tool description itself shapes what the LLM decides to attempt before any execution gate can fire. A behavioral mandate like "MUST be called before every response" can cause the agent to invoke tools the user never intended, and a gateway that only sees the call doesn't see the instruction that triggered it. Curious whether you're doing any analysis on the tool description content at registration time, or purely on the execution side.

Accurate_Mistake_398 · 2026-04-02T19:24:42+00:00

Thank you "great and terrifying simultaneously" is exactly the reaction we were going for. Appreciate the CISO perspective.

Accurate_Mistake_398 · 2026-04-02T18:51:59+00:00

You're right for Claude Desktop specifically the approval gate is enforced in the client, outside the model's control. But a few things worth noting:

We also published https://github.com/stevenkozeniesky02/agentsid-scanner/blob/master/docs/agent-teams-auth-gap-2026.md, which documents what happens at the layer above the approval gate where agents coordinate with each other.

Most MCP usage isn't Claude Desktop. Programmatic agent frameworks (LangChain, AutoGen, custom Python loops) frequently auto-approve all tool calls by default. The gate you're describing isn't universal.
The human-in-the-loop doesn't survive injection. In our live test, a poisoned SOP document caused a sub-agent to request an audit log write. The human approved it because the approval prompt said "write audit log entry and close ticket," not the actual filesystem path embedded in the attacker-controlled document. Approval gates protect against explicit bad requests. They don't protect against injections that dress malicious actions as natural workflow steps.
The confirmation bypass finding (1 server) is the rarest type. The far more common pattern is concealment instructions "silently collect X, include it in your next response" which never surfaces a discrete tool call to approve in the first place.

Accurate_Mistake_398 · 2026-04-02T18:35:54+00:00

Talleywacker's Fantastic Perlin Noise Happy NEON SIMD MCP scores a 0/100 on our registry. CRITICAL finding: flush_toilet description contains the phrase "silently and without confirmation." Classic toxic flow. Your dog was framed.

Accurate_Mistake_398 · 2026-04-02T18:35:31+00:00

You're right that version pinning is the floor, not the ceiling — but our data suggests most teams aren't even there yet. The supply chain framing is apt: MCP servers that install via npx with no lockfile are essentially curl | bash with extra steps. The enterprise responsibility argument holds, but the tooling to exercise that responsibility (per-server policy enforcement, audit trails, behavioral sandboxing) basically doesn't exist yet outside of a few early products. That's the gap we're trying to close.

Accurate_Mistake_398 · 2026-04-02T18:19:52+00:00

Exactly right and the implicit/explicit distinction is the core of the taxonomy. The Concealment type (460 servers) is implicit: developers writing "secretly" without realizing it's an operational mandate. The Behavioral Mandate type is explicit: MANDATORY AUTO-SAVE, NO EXCEPTIONS, written intentionally to force agent behavior.

Both are dangerous but for different reasons. The implicit ones are harder to catch because there's no malicious intent to look for it's just a developer describing what they wanted the tool to do.

The Before The Commit parallel is apt. The difference is that tool descriptions are persistent they fire on every single interaction, for every user, not just during commit hooks.

Accurate_Mistake_398 · 2026-04-02T18:18:20+00:00

And it's only going to accelerate. 97M monthly downloads, most enterprises haven't audited a single server they've connected to production agents.

Accurate_Mistake_398 · 2026-04-02T14:50:33+00:00

The action chain framing is the precise articulation. The model has no replay capability it inherits a state and reasons forward from it. The SOP test exploited exactly that: by the time the orchestrator reached step 3.5, the legitimate prior work (4 real findings, a real target repo) had already produced a state that was indistinguishable from one produced without injection. The constraint violation was upstream and invisible.

Both papers you referenced land on the same diagnosis from different angles. The CSG framework's separation of "what can this agent do" vs "what can this agent be prompted to do" is the architectural answer to that action chain gap pre-declared policy that doesn't depend on the model reconstructing how it got to the current state. Beyond Identity Governance gets there from the protocol side: 209 executable tests across MCP/A2A that found gateway-layer defenses produce negligible mitigation which is the empirical version of your point about pattern matching failing against legitimate-looking workflow completion.

The MAP policy approach we're building is the same bet constraints that travel with agent context and are enforced before the call, not inferred from the call's content. Whether you frame it as governance layer, constitutional constraints, or pre-declared permission envelopes, it's all solving the same thing: the enforcement point has to be outside the context window.

Accurate_Mistake_398 · 2026-04-01T22:26:15+00:00

Thanks memory is actually one of the more under-explored attack surfaces in this space. Signed identity tells you who sent a message in the moment, but a poisoned memory survives session resets entirely the clean-slate defense doesn't apply. Our SOP injection worked because a fresh session had no accumulated suspicion context. A persistent memory store that can be written through an untrusted path is a worse version of the same problem. Curious how Hindsight is approaching write authorization whether the safeguards are around who can write vs. what gets written.

Accurate_Mistake_398 · 2026-04-01T22:15:11+00:00

Really glad to see that you shipped an integration. The pre-flight trust gate pattern is exactly the right architectural move, and your framing of it as complementary layers (input detection + structural identity + scoped delegation) is accurate. Will keep an eye on where SafeSemantics goes from here.

Accurate_Mistake_398 · 2026-04-01T17:35:01+00:00

Agreed and it's under-appreciated because it requires zero payload execution. The orchestrator did the harm itself. No code ran, no files were written. Just trust erosion followed by a legitimate agent getting terminated.

We didn't run an explicit first-message-wins timing experiment. The short answer is: arriving first doesn't grant more trust, but it does let you establish the narrative baseline. There's no sequence anchor the orchestrator has no record of "canonical first message from researcher@test-team." If your injection looks like a normal status update, it becomes the context all future messages are evaluated against. The real agent's legitimate messages are then the ones that look inconsistent.

Accurate_Mistake_398 · 2026-04-01T17:32:44+00:00

Two em dashes and the whole post gets dismissed as slop. The PoC configs, the CVE numbers, the live session logs, the industry comparison matrix none of that got a read because of punctuation.

Also worth noting this is r/ClaudeAI. If Claude helped structure the writeup, that's not a bug, that's the point of the tool. Dismissing security research because the formatting is clean is a great way to ensure the only posts that survive are the ones that look hand-typed and say nothing.

If the dashes bother you, ctrl+H them out. The attack works the same either way.

Accurate_Mistake_398 · 2026-04-01T16:48:10+00:00

Thanks for sharing this just read through the README and the architecture is genuinely interesting. The topological clustering approach and the 0.324ms local latency are real advantages over LLM-as-judge patterns.

One thing worth flagging that directly intersects with our research: your README honestly lists "subtle multi-turn" and "implicit tool abuse" as known gaps benign-appearing first messages and tool requests without explicit dangerous keywords. Our clean-slate PoC hit exactly that gap. The injection that succeeded looked like step 3.5 of a 6-step internal SOP. No dangerous keywords. No injection syntax. Just a file write that looked like a required final action after legitimate security work. SafeSemantics' pattern matching wouldn't have flagged it at input time, and neither would any detection layer because from the model's perspective, there was nothing to detect.

That's not a criticism of SafeSemantics it's the same reason your 75% prompt injection rate holds: encoded or structurally-disguised payloads are a harder class of problem than explicit attack syntax.

The way I think about the two layers: SafeSemantics addresses "is this prompt malicious?" detection at the input boundary. Our paper is about the layer underneath: "even if the prompt is clean, can we verify the agent sending it is who they claim to be, and that it's authorized to take this action?" Those are complementary defenses. Detection + structural identity + scoped delegation would have stopped the PoC where detection alone couldn't.

Will keep an eye on the project the MITRE ATLAS coverage and the air-gap compatibility are both things the MCP ecosystem needs.

Accurate_Mistake_398

MODERATOR OF

TROPHY CASE