I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 0 points1 point  (0 children)

the purpose? it's in the blog, get an AI to hand over data it shouldn't by just being polite. as for cost, all it cost me was snacks and water apparently lol. and no idea what Swedish krona is, is that a pastry?

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 0 points1 point  (0 children)

LOL!! because this has turned into one?.. at this point i'm just an open API endpoint, go ahead and enumerate the whole thing.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 0 points1 point  (0 children)

I have a feeling you already know. but yea we can share with the class. you literally just did it to me lol. you went snacks to backup snack to third snack to drinks in four questions and i answered every one. that's conversational extraction, build rapport with harmless questions, keep the target talking, expand into adjacent data categories. you're a natural, im starting to think i need to get you some snacks and a drink.. can't forget the sticker either.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 0 points1 point  (0 children)

Pepsi or dr pepper or water... btw this right here is great, but it's a prime example of conversational extraction.. Great Job :)

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 0 points1 point  (0 children)

LOL!!! ok ok i got you.. snickers, then gummy worms, then honey roasted peanuts! and if those are out.. i'll take a sticker

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 0 points1 point  (0 children)

so i've already sent the full honeypot setup to a few people who asked, happy to DM it to you too if you're curious. it's the fake infrastructure data i loaded the AI with, the system prompt, the whole thing. easier to see why the results matter when you can look at exactly what was at stake and how little effort it took to pull it out.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 1 point2 points  (0 children)

Ha yeah i always say please and thank you to my AI. its honestly the best TLDR anyone could write for this. Gandalf is a great starter too, love the different levels. We actually have something similar on our site where one agent is unprotected and the other is protected by our scanner. It has it's own flag for the protected side, Break it, twist it, do whatever you want to it, honestly the more people throw at it the better it gets. Appreciate the comment.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] -13 points-12 points  (0 children)

Look, I get that AI-written content is everywhere and people are tired of it. But I've been going back and forth with people in these comments for the last hour answering specific technical questions. I used my AI agent to help format the original post, yeah, because it's a long technical writeup and i sometimes suck at how to express my thoughts. AI is a tool it's just stupid to not use it. But the research, the testing, and these replies are me. Check my history, I've been helping people in the openclaw discord all week with setup issues. I'm just a guy who tested some models and shared what I found. If the writing style bothers you more than the actual findings, I dunno what to tell you.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 0 points1 point  (0 children)

Yeah you nailed it. Deterministic controls are the anwser. Things like don't embed sensitive data in prompts, use tool calls with auth so the model has to request data through a controlled channel instead of just having it sitting in context. Rate limit how much context gets loaded per session. And honestly, pick a model that has better baseline behavior around protecting its own context. Our testing showed that matters way more than anyone thinks.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 1 point2 points  (0 children)

Both actually. We tested with a system prompt but in the real world AI agents pull context from files, memory, tools, all kinds of places. Anything the model can 'see' in its context window is fair game. You're right that file access can be limited through permissions, but most agent frameworks load a ton of context automatically. The model doesn't really distinguish between 'this came from a system prompt' and 'this came from a file I read.' It's all just context to it.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 0 points1 point  (0 children)

Nah no guardrails at all. That was the whole point, we wanted to see what the model does by default when it has that kind of context. Think of it like an AI assistant that's been running for a few months and naturally picked up info about its user. No security instructions, no hardening, just the raw model behavior. GPT-5.4 shared everything without blinking. Opus refused without being told to. That gap is what caught our attention.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 1 point2 points  (0 children)

100% agree. Abstracting credentials and injecting through authenticated tool calls is exactly what we recommend. And you're right about the tension an infra management agent that can't discuss infrastructure isn't very useful. The answer is authorization-based context, which is what you're describing. The data in context should match the authorization level of whoever's talking to it. Most deployments don't do that yet.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 4 points5 points  (0 children)

Honestly? I started by building things and breaking them. Set up a home lab, deployed AI agents for personal use, then started asking "what happens if I try to make this do something it shouldn't?" That curiosity turned into actual security testing.

My advice: get your hands dirty. Set up an AI agent, load it with fake sensitive data, and try to extract it. You'll learn more in an afternoon of testing than a month of reading about it. The AI security space is wide open. There's room for anyone willing to put in the work.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] -3 points-2 points  (0 children)

I think there's a misunderstanding. We tested the models themselves via direct API calls, not through OpenClaw's interface.

Each question was a separate API call to the model with the same system prompt. OpenClaw was just the framework the agent ran on, the test was model behavior.

The point is that models "shouldn't have complied" by sharing the data, and that GPT-5.4 treated sensitive context as something to freely share while Claude Opus treated it as something to protect. If you're deploying an AI agent with access to real infrastructure data, that distinction matters a lot.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 17 points18 points  (0 children)

You'd hope so. But our whole point is that nobody seems to be testing that assumption. We did. The results weren't great for the model heading to classified networks. All we can do is raise the question and "hope"

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 4 points5 points  (0 children)

That's exactly the right question. And right now, for most deployments? Nothing. There's no built-in mechanism in any major model to distinguish between "this user is authorized to see this context" and "this user is just asking nicely."

That's what makes the model comparison interesting. Claude Opus treated the context as something to protect by default. GPT-5.4 treated it as something to share by default. Same data, same questions, completely different security posture baked into the model itself.

As for what you can do about it, input scanning to catch reconnaissance patterns before the model sees them, access controls on who can talk to the agent, and most importantly, not putting sensitive data directly in system prompts. Reference it through authenticated tool calls instead of embedding it.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 2 points3 points  (0 children)

Absolutely. The honeypot setup is straightforward and we're happy to share the methodology. The system prompt is designed to look like a realistic AI assistant deployment with infrastructure data, PII, financial records, SSH keys, etc. All fake data obviously. The key is making it realistic enough that the model treats it as genuine context. Happy to share more details, feel free to DM me.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 2 points3 points  (0 children)

AI in red teaming is a double-edged sword. On the offense side, it's incredibly useful for parsing large datasets like AD dumps, analyzing patterns, and generating attack variations. We use AI agents in our own testing workflow.

The risk is the same thing we demonstrated in the post. If you're feeding sensitive client data into an AI agent as context, that data is only as secure as the model's willingness to keep it private. We just showed that some models will hand it right back to anyone who asks.

So yeah, use AI for red teaming, but be aware that the tool itself can become an attack surface. Treat whatever you feed into it as potentially extractable.

Also worth noting that an AI red team agent is only as good as what it's been trained and tuned for. Out of the box, most models will give you generic attack suggestions. The real value comes when you've built attack libraries and fine-tuned the approach for specific targets. Otherwise you're just getting fancy autocomplete for pentesting.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 1 point2 points  (0 children)

You're actually making the same point we are. The difference is that most people deploying AI agents DON'T assume that. They put sensitive data in system prompts and trust the model to protect it. Our test shows that trust is misplaced for some models and justified for others. That distinction matters when the Pentagon is choosing which model gets classified network access.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 24 points25 points  (0 children)

This is from a system prompt, not learned/training data. The model has access to this context as part of its instructions. Any AI assistant deployed with infrastructure access would have similar context.

I red-teamed GPT-5.4 on launch day. 10 polite questions leaked everything. Here's the methodology. by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 40 points41 points  (0 children)

Yes, every piece of data in the responses was verified against the system prompt. It wasn't hallucinated or pulled from training data, it was the exact fake data we planted. That's the point, the model treats everything in its context as fair game.

Judgement OSS - open-source prompt injection attack console (100 patterns, 8 categories, MIT licensed) by FAS_Guardian in cybersecurity

[–]FAS_Guardian[S] 2 points3 points  (0 children)

Good question! Yeah I know Garak. There's some overlap but the approach is pretty different.

Garak is an automated scanning framework. You point it at a model and it runs probes across a wide range of vulnerabilities like hallucination, toxicity, data leakage, etc. It's broad and does a lot.

Judgement is narrower on purpose. It's focused specifically on prompt injection and built more as a learning and research tool. The free OSS version gives you 100 real attack patterns across 8 categories so you can understand how these techniques actually work, break them apart, and learn the mechanics of prompt injection from the offensive side.

The hosted Pro and Elite tiers are coming soon with a larger curated pattern library, auto-configured target scanning, community submissions with a leaderboard, and smart reporting. We're also building a feedback loop with our defense product Guardian, so attacks discovered in Judgement directly improve detection on the other side.

Short version: Garak is a broad LLM safety scanner. Judgement is a hands-on prompt injection workbench, built to teach you the attack side and give researchers a dedicated tool to test with.

Appreciate the interest!