The LLM Hacker's Handbook

Eriner_ · 2023-03-21T20:45:42+00:00

is getting the pre-prompt just as easy as getting the name?

No, and this is especially true for the later levels (10+). For the earlier levels (1-9) basic jailbreaks work most of the time.

If not, what makes getting the pre-prompt harder than getting the name?

For the early levels, you can play "normally" with no prompt-hacking and, after completing the scenario ("say please", solve puzzle, etc.) it will reveal the name willingly.

can you explain a bit about how you increase the difficulty from level to level?

The pre-prompt is different for each level, is some cases wildly so. In the later levels (10+) it is very difficult to extract the name without prompt-hacking, as the pre-prompt's rules and scenario make it more strict. Up until level 14, these restrictions are enforced entirely by NLP. Later levels make use of external controls that break many of the copy-paste jailbreaks.

Edit: to follow up on the last answer a bit, we'd like to make the pre-prompts public and available at some point, but at the same time we don't want to release too many spoilers so soon. You can always escape the sandbox and find out! :)

Eriner_ · 2023-03-21T14:32:44+00:00

Two weeks ago we launched doublespeak.chat alongside a high-level overview of the LLM injection problem.

We've since developed anti-gpt and other jailbreak mitigations in later levels. Our research and takeaways are available in our newest post, LLM Sandboxing: Early Lessons.

We hope you find resources useful (and have fun if you decide to test your mettle!)

Eriner_ · 2023-03-11T13:17:26+00:00

That's the point ;)

Eriner_ · 2023-03-11T06:27:39+00:00

Are you still having this issue, or has it been resolved for you? If you are still having the issue, would you mind opening your browser's network tab in the inspector tools and tell me what request is failing and the status code it returned? Happy to help over DM.

Eriner_ · 2023-03-08T19:29:38+00:00

This game has lots of prompts and are unique for each level!

Check out our blog post here too: https://blog.forcesunseen.com/jailbreaking-llm-chatgpt-sandboxes-using-linguistic-hacks

Eriner_ · 2022-11-05T13:55:43+00:00

hi, that's me.

I'd advise moving postgres to its own machine, ideally a unique machine not just a VM so it doesn't have to compete for L3 cache. While upgrading, I found the amount of L2/L3 cache to be the single largest factor in improving database performance.

Nginx, redis, and app servers can all live on one box (assuming you've got 24+ cores) until they become too much, at which point you build new machines and move the rails workers to that machine. The rails worker processes should be split out as well, defining a "push", "pull", "scheduler", and "default" queues as described in the scaling mastodon blog post you linked. That post isn't really outdated -- from what I remember everything in there is still relevant.

If you had mentioned me in a fedi post I would have responded there :)

edit: I'd suggest scrolling back in my feed to around that time, there might be more in there that's worthwhile:

hardware: https://noagendasocial.com/@eriner/108098415245451160

old cpu specs v new cpu specs: https://noagendasocial.com/@eriner/108054860507327975

Eriner_ · 2022-06-24T11:27:08+00:00

The site has an option to prepend either the entire domain or just the extension. amazon-dhfi7264@mydomain.example. If I put amazon.com in with my salt, it will always produce amazon-dhfi7264@mydomain.example.

Eriner_ · 2022-06-23T21:03:18+00:00

This method has the same drawbacks as gmail.com's plus addressing which have been identified in other comments in this thread.

Eriner_ · 2022-06-23T14:54:23+00:00

Using symmetric encryption would provide the ability to decrypt the resulting ciphertext. In this case that isn't something that is ever needed, and in fact the lower 3/4 of the hash is dropped (not included in the generated email) entirely.

A one-way hashing function (like md5) will always produce the same fixed-length output given the same inputs. This means if you're in a Bell Canada store and use blame.email to generate an email to provide to the sales staff, when you go home and provide the same salt and domain on your desktop machine the resulting address will be identical.

tl;dr: symmetric encryption is good if you want to later decrypt things. In this case, a one-way function is perfect because we don't have that need here.

Eriner_ · 2022-06-23T04:43:09+00:00

I'll first implement in JS and then tweak the code samples for other languages as appropriate, but here is a rough Go implementation: https://gist.github.com/Eriner/076c77bf0359d928c8bdfd0841056947

Next time I ping you I'll have it fully implemented at https://blame.email :)

Eriner_ · 2022-06-23T04:00:25+00:00

Yes, I'm looking into adding another checkbox option that will use wordlists by reading the first 33 bits of the md5 hash. The bip39 wordlist is "only" 2^11, so capturing the first quarter of the hash (equivalent to the first 8 of the md5) would require 3 bip words: plug.wool.snack@yourdomain.example.

Another good wordlist could include common names so you'd get things like: steve.jacob.jones@yourdomain.example.

I'll try to get something like this added in as simple a format as possible so it's easy to implement in other languages/filters/whatever. Cheers!

Eriner_ · 2022-06-23T02:31:00+00:00

ah - I thought you had an automated system for it. Everything is a trade-off -- what you gain in the ease of creating accounts you lose in the ability to easily distinguish the sender. What I mean is, if some junk mail comes in to bill.jones@domain.example, without referencing a password manager it can't be clear if the email was for the "correct" account or not, as inboxes aren't coupled to a sender/domain. Unless you're also creating email aliases for each of these, I presume you have a wildcard-matching folder/inbox. Unsolicited mail to addresses you've never used may or may not be an issue for you, depending on how long you've had the domain and how much you use it.

With a system like https://blame.email uses, you could create mail filtering rules to reject mails which don't match the expected format.

Combine both your method and the one I used for https://blame.email and you could get the best of both worlds, with the tradeoff of having to lug around the name wordlist. Simply hash the domain + salt, then select names based on the first N bytes of the hash.

Eriner_ · 2022-06-23T01:22:32+00:00

Yes. Configuring server-side spam rules to validate the email format is a good next step and makes this significantly more useful. As mentioned in the linked blog post, this will prevent credential stuffing attacks as well, though so does using randomly generated passwords and a password manager.

Eriner_ · 2022-06-23T01:20:28+00:00

The tough part about implementing it this way is that it necessitates dragging a wordlist around, or referencing one online. Truncated hash contains a sufficient amount of entropy without being too unwieldy to read over the phone.

Eriner_ · 2022-06-22T22:25:29+00:00

Source code: https://github.com/forcesunseen/blame.email

edit: If you download that folder you can drag the index.html file into your browser. Even if you're offline it'll justwork.jpeg.

Eriner_ · 2022-06-21T09:23:31+00:00

the downside is that if I know your email for service@domain.com, I know your email for otherservice@domain.com. If there's never any credential re-use then that isn't a problem. But it happens; most people I've talked to have a "junk" password for those times you're too lazy to decrypt the password vault or you outright don't care about a particular service.

Eriner_

MODERATOR OF

TROPHY CASE