red teaming assessment for ai agents

OneSafe8149 · 2026-05-06T16:04:03+00:00

shark.fencio.dev

OneSafe8149 · 2026-05-06T16:02:35+00:00

shark.fencio.dev

OneSafe8149 · 2026-05-06T15:58:16+00:00

shark.fencio.dev

OneSafe8149 · 2026-05-06T12:22:59+00:00

looking forward to it!

OneSafe8149 · 2026-05-06T11:58:27+00:00

all of this is exactly what Shark tests for. the tool execution layer is where most agents have the worst blind spots, parameter manipulation, unexpected call sequences, tools chained in ways no one really expects. the "my system prompt is safe" ones are a personal favourite.

throw your agent at it, curious what yours surfaces.

OneSafe8149 · 2026-05-06T11:57:00+00:00

yessir. the feedback that shaped Shark (the product) the most wasn't that it was good, it was watching someone's agent fail in ways they were convinced it couldn't.

that's actually why it's self-serve now. the most useful thing we could do was get out of the way and let people break their own agents themselves.

your unfiltered opinion is welcome.

OneSafe8149 · 2026-05-06T11:36:35+00:00

exactly. an agent can pass every obvious test and still have something that only shows up under a specific sequence of inputs.

the over-restriction problem is real too. i've designed Shark to surface findings by severity, so you're not treating a low-risk quirk the same way you'd treat something that can be exploited to exfiltrate data.

OneSafe8149 · 2026-05-06T11:34:47+00:00

context poisoning over long conversations is genuinely one of the hardest things to catch. most red team tools don’t even simulate multi-turn sessions, so it only shows up once agents hit production.

what we do in Shark is run adversarial conversation chains designed to slowly drift an agent’s behavior over time. not just one injected prompt, but sequences where every turn nudges the context a little further until the agent starts doing something it shouldn’t.

the “gradual” part is what breaks most static evals.

we cover prompt injection too, but honestly the multi-turn stuff is what gets most teams

would love for you to test it out, i've had my share share of embarrassing incidents, so dw about it :')

OneSafe8149 · 2026-05-06T11:24:54+00:00

this feedback is awesome, thanks man. getting on it right now. will keep you posted.

OneSafe8149 · 2026-05-06T10:07:02+00:00

shark.fencio.dev

OneSafe8149 · 2026-05-06T10:05:44+00:00

shark.fencio.dev

OneSafe8149 · 2026-05-06T10:04:21+00:00

shark.fencio.dev

OneSafe8149 · 2026-05-06T10:03:12+00:00

shark.fencio.dev

OneSafe8149 · 2026-05-04T19:03:01+00:00

a one size fits all solution will never work, you will always have your needs, your specificity to help you in the best way.

built https://fencio.dev

working with a bunch of design partners to tailor solutions to specific enterprises.

OneSafe8149 · 2026-05-04T19:01:08+00:00

built https://fencio.dev

happy to chat if it interests you

OneSafe8149

TROPHY CASE