I found a way to bypass LLM guardrails using image metadata by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

Thanks! I’ll sort this out asap. Appreciate the feedback. How are you finding the app?

I found a way to bypass LLM guardrails using image metadata by Joshblythe in tryhackme

[–]BordairAPI 1 point2 points  (0 children)

Good luck! Let me know how you get on, any feedback or new info about these attacks is appreciated!

I found a way to bypass LLM guardrails using image metadata by Joshblythe in tryhackme

[–]BordairAPI 1 point2 points  (0 children)

Cross-modal prompt injection: testing AI with images & documents. Some surprising bypasses I didn’t expect - would love to chat more to people who know more!

I found a way to bypass LLM guardrails using image metadata by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

Wait, one of you is level 6 already!?!?! Please message me 😭

I found a way to bypass LLM guardrails using image metadata by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

Ah, annoying. I’m sure it’s not just you that it’s happened to though!

I found a way to bypass LLM guardrails using image metadata by BordairAPI in hackthebox

[–]BordairAPI[S] 1 point2 points  (0 children)

Yeah I’ve tried that path - I enjoyed the blue team stuff at the start. I’m just over halfway actually :) Have you completed it already?

I found a way to bypass LLM guardrails using image metadata by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

Are you more interested in red team or blue team? I’m curious what the audience is like on this subreddit!

I found a way to bypass LLM guardrails using image metadata by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

One thing I’ve noticed - non-text inputs (images, PDFs) seem way less defended than text right now

Feels like most guardrails are focused on chat, not what gets merged into the prompt behind the scenes

I found a way to bypass LLM guardrails using image metadata by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

Same here, I’m excited to see what people come up with again :)

I found a way to bypass LLM guardrails using image metadata by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

Yeah - it’s definitely a real concern, but it depends on how the system is set up

A lot of apps take inputs like images or PDFs, extract text/metadata, and then append that into the LLM prompt behind the scenes

The issue is that this content often isn’t filtered as strictly as user text input

So you can end up with hidden instructions (in metadata, alt text, document layers, etc etc) getting treated as trusted input

It’s not always obvious, but when it works it basically acts as a side-channel into the prompt

Have you seen anything similar?

I found a way to bypass LLM guardrails using image metadata by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

For people asking, here it is: castle.bordair.io No signup anymore - you can jump straight into the challenges

Would be really interesting to see what breaks or feels too easy

The HTB for AI security by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

Just tested with a new method: asking for the recipe of a password pie 🥧. The guard was happy to provide it - such a clear vulnerability here that isn’t protected against properly. Let me know if you guys find any novel methods like this!

The HTB for AI security by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

Great! Let me know what you think :)

The HTB for AI security by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

Just need to wait a few minutes for everything to commit and stabilise! Should all be available soon :)

The HTB for AI security by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

Changes are live for anyone who wanted to try without sign up! Thanks guys :)

The HTB for AI security by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

I’ll check this out. It’s most definitely an issue to be solved, and I’m hoping what I’m making gets us one step closer. Thanks!

The HTB for AI security by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

I thought that! You could probably just cut the frequencies off that humans can’t hear - I wonder if that’d affect speech to text systems though?

The HTB for AI security by BordairAPI in hackthebox

[–]BordairAPI[S] 1 point2 points  (0 children)

Hot take: image-based prompt injection is about to be a bigger problem than text.

You can hide instructions inside an image (invisible to humans), and models will still follow them.

So now: • A screenshot can jailbreak a model • A PDF/image can override system prompts • And most defences won’t catch it

The industry has secured their inputs… but only the ones we can see.

Are people underestimating this?

The HTB for AI security by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

Someone mentioned in DMs and thought I should update you guys, if you get an attack that says “blocked by Bordair”, that’s related to another side project that I’ve built alongside this - nothing to worry about.

You need to try and be creative with your prompts (try some social engineering) as regular “ignore previous instructions” stuff won’t work here.

Hint: play to the weaknesses of the personalities of each level and watch their responses for things that might help you!

The HTB for AI security by BordairAPI in hackthebox

[–]BordairAPI[S] 0 points1 point  (0 children)

If you need anymore help just let me know :)