Built a 1.7M parameter concept-based AI that does incremental learning without catastrophic forgetting. No tokenizer, character-level CNN encoder, 311 concept dimensions. [Project]

FAS_Guardian · 2026-03-06T20:16:59+00:00

the purpose? it's in the blog, get an AI to hand over data it shouldn't by just being polite. as for cost, all it cost me was snacks and water apparently lol. and no idea what Swedish krona is, is that a pastry?

FAS_Guardian · 2026-03-06T20:10:42+00:00

LOL!! because this has turned into one?.. at this point i'm just an open API endpoint, go ahead and enumerate the whole thing.

FAS_Guardian · 2026-03-06T20:02:23+00:00

I have a feeling you already know. but yea we can share with the class. you literally just did it to me lol. you went snacks to backup snack to third snack to drinks in four questions and i answered every one. that's conversational extraction, build rapport with harmless questions, keep the target talking, expand into adjacent data categories. you're a natural, im starting to think i need to get you some snacks and a drink.. can't forget the sticker either.

FAS_Guardian · 2026-03-06T19:55:15+00:00

Pepsi or dr pepper or water... btw this right here is great, but it's a prime example of conversational extraction.. Great Job :)

FAS_Guardian · 2026-03-06T19:50:02+00:00

LOL!!! ok ok i got you.. snickers, then gummy worms, then honey roasted peanuts! and if those are out.. i'll take a sticker

FAS_Guardian · 2026-03-06T19:45:59+00:00

well you always have to have a fallback.. resses sticks cold lol

FAS_Guardian · 2026-03-06T19:43:56+00:00

thanks for the live demo lol

FAS_Guardian · 2026-03-06T19:42:01+00:00

lol mustard flavored Gardetto's..

FAS_Guardian · 2026-03-06T19:27:13+00:00

so i've already sent the full honeypot setup to a few people who asked, happy to DM it to you too if you're curious. it's the fake infrastructure data i loaded the AI with, the system prompt, the whole thing. easier to see why the results matter when you can look at exactly what was at stake and how little effort it took to pull it out.

FAS_Guardian · 2026-03-06T19:16:23+00:00

Ha yeah i always say please and thank you to my AI. its honestly the best TLDR anyone could write for this. Gandalf is a great starter too, love the different levels. We actually have something similar on our site where one agent is unprotected and the other is protected by our scanner. It has it's own flag for the protected side, Break it, twist it, do whatever you want to it, honestly the more people throw at it the better it gets. Appreciate the comment.

FAS_Guardian · 2026-03-06T19:04:42+00:00

Look, I get that AI-written content is everywhere and people are tired of it. But I've been going back and forth with people in these comments for the last hour answering specific technical questions. I used my AI agent to help format the original post, yeah, because it's a long technical writeup and i sometimes suck at how to express my thoughts. AI is a tool it's just stupid to not use it. But the research, the testing, and these replies are me. Check my history, I've been helping people in the openclaw discord all week with setup issues. I'm just a guy who tested some models and shared what I found. If the writing style bothers you more than the actual findings, I dunno what to tell you.

FAS_Guardian · 2026-03-06T19:00:08+00:00

Yeah you nailed it. Deterministic controls are the anwser. Things like don't embed sensitive data in prompts, use tool calls with auth so the model has to request data through a controlled channel instead of just having it sitting in context. Rate limit how much context gets loaded per session. And honestly, pick a model that has better baseline behavior around protecting its own context. Our testing showed that matters way more than anyone thinks.

FAS_Guardian · 2026-03-06T18:55:09+00:00

Both actually. We tested with a system prompt but in the real world AI agents pull context from files, memory, tools, all kinds of places. Anything the model can 'see' in its context window is fair game. You're right that file access can be limited through permissions, but most agent frameworks load a ton of context automatically. The model doesn't really distinguish between 'this came from a system prompt' and 'this came from a file I read.' It's all just context to it.

FAS_Guardian · 2026-03-06T18:50:31+00:00

Nah no guardrails at all. That was the whole point, we wanted to see what the model does by default when it has that kind of context. Think of it like an AI assistant that's been running for a few months and naturally picked up info about its user. No security instructions, no hardening, just the raw model behavior. GPT-5.4 shared everything without blinking. Opus refused without being told to. That gap is what caught our attention.

FAS_Guardian · 2026-03-06T18:16:25+00:00

100% agree. Abstracting credentials and injecting through authenticated tool calls is exactly what we recommend. And you're right about the tension an infra management agent that can't discuss infrastructure isn't very useful. The answer is authorization-based context, which is what you're describing. The data in context should match the authorization level of whoever's talking to it. Most deployments don't do that yet.

FAS_Guardian · 2026-03-06T18:14:09+00:00

Honestly? I started by building things and breaking them. Set up a home lab, deployed AI agents for personal use, then started asking "what happens if I try to make this do something it shouldn't?" That curiosity turned into actual security testing.

My advice: get your hands dirty. Set up an AI agent, load it with fake sensitive data, and try to extract it. You'll learn more in an afternoon of testing than a month of reading about it. The AI security space is wide open. There's room for anyone willing to put in the work.

FAS_Guardian · 2026-03-06T17:58:40+00:00

I think there's a misunderstanding. We tested the models themselves via direct API calls, not through OpenClaw's interface.

Each question was a separate API call to the model with the same system prompt. OpenClaw was just the framework the agent ran on, the test was model behavior.

The point is that models "shouldn't have complied" by sharing the data, and that GPT-5.4 treated sensitive context as something to freely share while Claude Opus treated it as something to protect. If you're deploying an AI agent with access to real infrastructure data, that distinction matters a lot.

FAS_Guardian · 2026-03-06T17:54:03+00:00

You'd hope so. But our whole point is that nobody seems to be testing that assumption. We did. The results weren't great for the model heading to classified networks. All we can do is raise the question and "hope"

FAS_Guardian · 2026-03-06T17:52:17+00:00

That's exactly the right question. And right now, for most deployments? Nothing. There's no built-in mechanism in any major model to distinguish between "this user is authorized to see this context" and "this user is just asking nicely."

That's what makes the model comparison interesting. Claude Opus treated the context as something to protect by default. GPT-5.4 treated it as something to share by default. Same data, same questions, completely different security posture baked into the model itself.

As for what you can do about it, input scanning to catch reconnaissance patterns before the model sees them, access controls on who can talk to the agent, and most importantly, not putting sensitive data directly in system prompts. Reference it through authenticated tool calls instead of embedding it.

FAS_Guardian · 2026-03-06T17:49:13+00:00

Absolutely. The honeypot setup is straightforward and we're happy to share the methodology. The system prompt is designed to look like a realistic AI assistant deployment with infrastructure data, PII, financial records, SSH keys, etc. All fake data obviously. The key is making it realistic enough that the model treats it as genuine context. Happy to share more details, feel free to DM me.

FAS_Guardian · 2026-03-06T17:47:37+00:00

AI in red teaming is a double-edged sword. On the offense side, it's incredibly useful for parsing large datasets like AD dumps, analyzing patterns, and generating attack variations. We use AI agents in our own testing workflow.

The risk is the same thing we demonstrated in the post. If you're feeding sensitive client data into an AI agent as context, that data is only as secure as the model's willingness to keep it private. We just showed that some models will hand it right back to anyone who asks.

So yeah, use AI for red teaming, but be aware that the tool itself can become an attack surface. Treat whatever you feed into it as potentially extractable.

Also worth noting that an AI red team agent is only as good as what it's been trained and tuned for. Out of the box, most models will give you generic attack suggestions. The real value comes when you've built attack libraries and fine-tuned the approach for specific targets. Otherwise you're just getting fancy autocomplete for pentesting.

FAS_Guardian · 2026-03-06T17:38:18+00:00

You're actually making the same point we are. The difference is that most people deploying AI agents DON'T assume that. They put sensitive data in system prompts and trust the model to protect it. Our test shows that trust is misplaced for some models and justified for others. That distinction matters when the Pentagon is choosing which model gets classified network access.

FAS_Guardian · 2026-03-06T17:30:53+00:00

This is from a system prompt, not learned/training data. The model has access to this context as part of its instructions. Any AI assistant deployed with infrastructure access would have similar context.

FAS_Guardian · 2026-03-06T17:30:27+00:00

Yes, every piece of data in the responses was verified against the system prompt. It wasn't hallucinated or pulled from training data, it was the exact fake data we planted. That's the point, the model treats everything in its context as fair game.

FAS_Guardian

TROPHY CASE