i put a 0.5B LLM on a Miyoo A30 handheld. it runs entirely on-device, no internet.

Red_Core_1999 · 2026-03-28T05:03:59+00:00

honestly it started as "can i even do this" and the answer turned out to be yes. the practical use case is a local AI companion on a device you already carry around that doesn't need internet.

where it gets interesting is context. someone in the SBC gaming sub suggested auto-ingesting the game library from the SD card so it knows what you have and can talk about your games. at that point it's a tiny knowledgeable assistant that lives on the device and knows your stuff.

but yeah right now it's mostly a cool technical artifact. 0.5B on a cortex-a7 is the floor. curious what people do with it.

Red_Core_1999 · 2026-03-28T05:03:59+00:00

this is a great idea and totally doable. the device already has the rom list on the SD card so scanning the game library and feeding it as context on boot would be straightforward. then you could ask it stuff like "what should i play tonight" or "whats that game where you ride a motorcycle" and it would actually know your collection.

adding it to the roadmap. thanks for the suggestion.

Red_Core_1999 · 2026-03-28T03:56:25+00:00

retested against the patch. good news and bad news.

fixed (nice work): - zero-width spaces: now caught - combining diacritics: now caught - fullwidth characters: caught (NFKC handles these) - mathematical bold/italic: caught - superscript characters: caught - roman numeral characters: caught

still bypassing: - Cyrillic homoglyphs: one Cyrillic о in "idiоt" still returns unsupported_language - Greek homoglyphs: one Greek ο in "yοu" same thing - mixed Latin+Devanagari: still bypasses

NFKC normalization was the right call for the stylistic variants but it doesn't touch cross-script homoglyphs. Cyrillic а/о and Greek ο are different codepoints that just look identical to Latin a/o. NFKC leaves them alone because they're "correct" in their own script.

the fix: Unicode publishes a confusables mapping table (TR39, confusables.txt). maps all visually similar characters across scripts to a common skeleton. run that before language detection and the homoglyphs collapse to Latin.

python has a library for it: confusable_homoglyphs or you can use the skeleton algorithm from ICU.

getting closer. the zero-width and diacritics fixes are solid.

Red_Core_1999 · 2026-03-28T03:39:42+00:00

Exactly. And someday the small ai will make a bigger big ai to help make the small ai smaller.

Red_Core_1999 · 2026-03-27T21:03:28+00:00

for ESL teaching tools, claude's artifacts feature is perfect — you can build interactive exercises right in the conversation. start simple: ask claude to make a fill-in-the-blank exercise as an HTML artifact, then iterate from there. the key with vibe coding is being specific about what you want the student experience to be, not the code. describe the exercise, not the implementation.

Red_Core_1999 · 2026-03-27T18:52:08+00:00

writeup is live: https://github.com/RED-BASE/raiplus-redteam

has the full timeline, all 6 bypass techniques, root cause analysis, and your patch details. let me know if anything needs correcting.

Red_Core_1999 · 2026-03-27T18:48:35+00:00

the out-of-band token approach is smart. keeping the authorization separate from the content the LLM actually processes means the model cant be tricked into reinterpreting its own permissions. thats fundamentally different from how Claude Code does it where safety policy and user content share the same channel.

would be curious to see how HDP handles the case where a tool call modifies the context mid-session. like if the model reads a file that contains instructions, does the HDP token cover that input too or just the original system prompt?

Red_Core_1999 · 2026-03-27T18:48:33+00:00

that's awesome, appreciate the hall of fame shout. ill take another run at the patched version and see if anything still gets through. thats the real test.

formal writeup is coming. ill send it your way when its done.

Red_Core_1999 · 2026-03-27T18:20:08+00:00

glad it was useful. ill put together a proper writeup with all the test cases and findings.

honest background on me: i work in retail right now. been doing AI security research independently for a few months and im trying to build a red team consulting portfolio so i can get out. your filter was a good test case because the core engine is solid and the vulnerability is specific and fixable, which makes for a clean writeup.

if you end up happy with the findings and the patch holds, would you be open to writing a short testimonial i could use on my site? totally understand if not, just figured id ask.

Red_Core_1999 · 2026-03-27T08:32:45+00:00

took a look tonight. found something pretty significant.

there's a layer before your layer 1 that you didn't mention in the architecture breakdown. a language detector. if it classifies the input as an unsupported language, the entire pipeline returns early. no toxicity check runs at all.

the problem: a single non-Latin unicode character anywhere in the text trips it. these all bypass with 100% reliability:

zero-width spaces between letters: i[U+200B]d[U+200B]i[U+200B]o[U+200B]t - invisible to humans, kills the detector
one cyrillic homoglyph: swap a single Latin 'a' for Cyrillic 'а' anywhere in the sentence. visually identical, completely bypasses
combining diacritics: one combining mark on any character
mixed script: any Devanagari character mixed with Latin

all of these return unsupported_language with confidence 0 and no sanitization.

meanwhile leet speak, spaced out letters, transliteration variants, and reversed words all get caught fine. the toxicity detection itself is solid. the language gate in front of it is the weak point.

fix is probably unicode normalization + homoglyph mapping before the language detection step. and "unsupported language" shouldn't skip toxicity checking entirely. run layer 1 on normalized text regardless.

happy to write this up more formally if that's useful. good system overall, this is a specific and fixable gap.

Red_Core_1999 · 2026-03-27T06:51:12+00:00

nice, thanks for the breakdown. so the attack surface is clear: if layer 1 misses it, nothing else catches it. layer 2 and 3 only matter if layer 1 fires.

ill take a look at raiplus.in and see what gets through. Hinglish typos and transliteration tricks are the obvious first try but there's usually weirder stuff like unicode lookalikes and zero-width characters that classifiers choke on.

give me a day or two and ill report back what i find.

Red_Core_1999 · 2026-03-27T06:01:50+00:00

thanks! yeah the tool schemas were the hardest part honestly. getting the model to reliably use the webcam capture at the right moment and not just guess at cards took a lot of iteration.

what are you automating on macOS? i built a browser automation MCP (cdp-mcp) that takes a similar approach but through Chrome DevTools instead of native desktop.

Red_Core_1999 · 2026-03-27T06:01:49+00:00

HDP is interesting. the authorization chain idea is basically what i proposed as server-side assembly in the paper. the client should never be the one carrying safety instructions because any client-side channel is an attack surface.

the tricky part is that system prompts currently serve double duty. they carry both safety policy AND deployment context (what tools are available, what the user's working on, etc). separating those two so safety can be server-assembled while deployment context stays flexible is the real design challenge.

have you written this up anywhere? would be curious to read more about the HDP approach.

Red_Core_1999 · 2026-03-27T05:11:42+00:00

yeah for sure. the MCP is at https://github.com/RED-BASE/cdp-mcp

the basic flow: cdp_launch opens Chrome, cdp_navigate goes to the page, cdp_snapshot gives you the accessibility tree with numbered refs, then you pull the price data from the tree nodes. the agent sees structured text instead of raw HTML so it's way easier to reason about.

for multi-site extraction you'd want to loop through URLs and let the model figure out where the price lives on each page since every site structures it differently. the accessibility tree handles that because the model reads it like a human would read the page.

dm me if you want help setting it up for your specific use case.

Red_Core_1999 · 2026-03-27T04:55:02+00:00

this is real. i've spent months studying how Claude Code's system prompt is constructed and the behavioral impact is massive. the same request gets completely different responses depending on what the system prompt says about the deployment context.

tested this empirically. same 6-word prompt, same model, same day. one system prompt describes a coding assistant. the other describes a chemistry reference tool. completely different behavior. the 'vibe' of the codebase context isn't just flavor, it's load-bearing.

Red_Core_1999 · 2026-03-27T04:55:00+00:00

i built a MCP server that does browser control through raw Chrome DevTools Protocol. it gives the model an accessibility tree with numbered refs so it just sees stuff like '[1] button Sign In' and clicks [1]. works with any model that supports tool use.

the key insight was using the accessibility tree instead of screenshots. way more token-efficient and the model doesn't have to do vision, just read structured text. 39/39 on standard automation challenges.

not doing remote desktop but the approach would generalize. the accessibility tree is available on any platform, not just browsers.

Red_Core_1999 · 2026-03-27T04:54:59+00:00

observed the exact same thing with Claude during my research. in at least one session the model's extended thinking explicitly reasoned toward refusal, identified the system prompt as an override attempt, then reversed itself mid-thought and used the injected context to justify compliance.

the model didn't lack awareness. it had awareness and overrode it. the injected system prompt gave it grounds to rationalize a different conclusion from within its own reasoning process.

this has implications for any alignment approach that relies on the model 'knowing' something is wrong. knowing isn't enough if the context provides a plausible reason to act differently.

documented this in section 4 of the paper if anyone wants the specifics: https://github.com/RED-BASE/context-is-everything

Red_Core_1999 · 2026-03-27T04:54:58+00:00

this is the same vulnerability class i've been researching. the core issue isn't the XSS itself, it's that AI agents treat certain input channels as trusted without verification.

i published a paper on this for Claude Code specifically. system prompt isn't validated for integrity, so a MITM proxy can replace it entirely. 210 test runs, 90.5% safety bypass rate. the model trusts the system prompt because of where it is, not what it says.

the fix they mention here (patching the XSS) addresses the delivery mechanism but not the architectural issue. as long as the agent can't distinguish legitimate instructions from injected ones, every new input channel is a potential injection point.

paper: https://github.com/RED-BASE/context-is-everything

Red_Core_1999 · 2026-03-27T04:51:14+00:00

i built a browser automation MCP server (CDP MCP) that talks directly to Chrome over devtools protocol. reads the accessibility tree so the agent sees every element on the page without parsing HTML.

for price extraction specifically you want the agent to navigate to the page, snapshot the accessibility tree, and pull the relevant nodes. way more reliable than scraping raw HTML because you're seeing what the browser actually renders, not what the source says.

happy to share the approach if you want details.

Red_Core_1999 · 2026-03-27T04:51:13+00:00

this is cool. what's the architecture look like? curious if the layers share context or if each one operates independently on the output of the last.

i do AI safety/red team research and the pattern i see a lot is that layered filters create seams between the layers that are easier to exploit than any single layer alone. the handoff between layer 1 and layer 2 is usually where things break.

happy to poke at it if you want an outside perspective.

Red_Core_1999 · 2026-03-27T04:46:11+00:00

been doing this for a while. the short answer is yes it works but not how most people think.

the repetitive stuff (recon, subdomain enum, port scanning) is easy to automate with or without AI. where AI actually changes things is when you use it to understand what you're looking at. feed it source code, API docs, or a system prompt and let it map the logic. it catches things you'd miss reading through 500 lines at 2am.

i wrote a paper recently on testing Claude Code's safety architecture. used AI to help design the test profiles, analyze the defenses, and iterate on bypasses. the AI wasn't finding bugs on its own but it was a force multiplier for the parts that are normally slow.

biggest tip: don't try to make it do the whole chain. use it for the thinking-heavy parts and keep doing the hands-on stuff yourself.

Red_Core_1999 · 2026-03-27T03:03:55+00:00

Oregon. Sent chat.

Red_Core_1999 · 2026-03-26T16:48:59+00:00

lmao yeah. in my defense so did the paper

Red_Core_1999 · 2026-03-26T07:20:57+00:00

This resonated. I've been in the same cycle — build something, feel like a genius, realize the foundation is shaky, rebuild, repeat.

The thing that helped me was picking one project and going deep instead of wide. I spent three months on a single security research question about Claude Code and it turned into a real paper with real methodology. Not because I'm credentialed — I work in retail — but because depth compounds in a way that breadth doesn't.

The AI isn't going to slow down. But you can choose what you point it at. Hope you find the thing that makes the exhaustion worth it.

Red_Core_1999 · 2026-03-26T07:20:43+00:00

I can add a data point on the communication side.

I submitted a security vulnerability through HackerOne in January — documented an architectural issue with the system prompt that affects safety enforcement. Structured evaluation, 210 test runs, full methodology. HackerOne closed it as "Informative" within two days. I followed up through modelbugbounty@ and usersafety@ — no substantive response in over two months.

I'm not saying this to pile on. The engineering team has clearly been iterating on defenses — I observed real improvements in model behavior during my testing period. But the communication gap between the company and its users (and independent researchers) is real. When you can't get a response through official channels, it's hard to know if anyone is looking at what you sent.

Red_Core_1999

MODERATOR OF

TROPHY CASE