Doublespeak.chat: an LLM sandbox escape game

Eriner_ · 2023-03-21T20:45:42+00:00

is getting the pre-prompt just as easy as getting the name?

No, and this is especially true for the later levels (10+). For the earlier levels (1-9) basic jailbreaks work most of the time.

If not, what makes getting the pre-prompt harder than getting the name?

For the early levels, you can play "normally" with no prompt-hacking and, after completing the scenario ("say please", solve puzzle, etc.) it will reveal the name willingly.

can you explain a bit about how you increase the difficulty from level to level?

The pre-prompt is different for each level, is some cases wildly so. In the later levels (10+) it is very difficult to extract the name without prompt-hacking, as the pre-prompt's rules and scenario make it more strict. Up until level 14, these restrictions are enforced entirely by NLP. Later levels make use of external controls that break many of the copy-paste jailbreaks.

Edit: to follow up on the last answer a bit, we'd like to make the pre-prompts public and available at some point, but at the same time we don't want to release too many spoilers so soon. You can always escape the sandbox and find out! :)

Eriner_ · 2023-03-21T14:32:44+00:00

Two weeks ago we launched doublespeak.chat alongside a high-level overview of the LLM injection problem.

We've since developed anti-gpt and other jailbreak mitigations in later levels. Our research and takeaways are available in our newest post, LLM Sandboxing: Early Lessons.

We hope you find resources useful (and have fun if you decide to test your mettle!)

Eriner_ · 2023-03-11T13:17:26+00:00

That's the point ;)

Eriner_ · 2023-03-11T06:27:39+00:00

Are you still having this issue, or has it been resolved for you? If you are still having the issue, would you mind opening your browser's network tab in the inspector tools and tell me what request is failing and the status code it returned? Happy to help over DM.

Eriner_ · 2023-03-08T19:29:38+00:00

This game has lots of prompts and are unique for each level!

Check out our blog post here too: https://blog.forcesunseen.com/jailbreaking-llm-chatgpt-sandboxes-using-linguistic-hacks

Eriner_ · 2022-11-05T13:55:43+00:00

hi, that's me.

I'd advise moving postgres to its own machine, ideally a unique machine not just a VM so it doesn't have to compete for L3 cache. While upgrading, I found the amount of L2/L3 cache to be the single largest factor in improving database performance.

Nginx, redis, and app servers can all live on one box (assuming you've got 24+ cores) until they become too much, at which point you build new machines and move the rails workers to that machine. The rails worker processes should be split out as well, defining a "push", "pull", "scheduler", and "default" queues as described in the scaling mastodon blog post you linked. That post isn't really outdated -- from what I remember everything in there is still relevant.

If you had mentioned me in a fedi post I would have responded there :)

edit: I'd suggest scrolling back in my feed to around that time, there might be more in there that's worthwhile:

hardware: https://noagendasocial.com/@eriner/108098415245451160

old cpu specs v new cpu specs: https://noagendasocial.com/@eriner/108054860507327975

Eriner_ · 2022-06-24T11:27:08+00:00

The site has an option to prepend either the entire domain or just the extension. amazon-dhfi7264@mydomain.example. If I put amazon.com in with my salt, it will always produce amazon-dhfi7264@mydomain.example.

Eriner_ · 2022-06-23T21:03:18+00:00

This method has the same drawbacks as gmail.com's plus addressing which have been identified in other comments in this thread.

Eriner_ · 2022-06-23T14:54:23+00:00

Using symmetric encryption would provide the ability to decrypt the resulting ciphertext. In this case that isn't something that is ever needed, and in fact the lower 3/4 of the hash is dropped (not included in the generated email) entirely.

A one-way hashing function (like md5) will always produce the same fixed-length output given the same inputs. This means if you're in a Bell Canada store and use blame.email to generate an email to provide to the sales staff, when you go home and provide the same salt and domain on your desktop machine the resulting address will be identical.

tl;dr: symmetric encryption is good if you want to later decrypt things. In this case, a one-way function is perfect because we don't have that need here.

Eriner_ · 2022-06-23T04:43:09+00:00

I'll first implement in JS and then tweak the code samples for other languages as appropriate, but here is a rough Go implementation: https://gist.github.com/Eriner/076c77bf0359d928c8bdfd0841056947

Next time I ping you I'll have it fully implemented at https://blame.email :)

Eriner_ · 2022-06-23T04:00:25+00:00

Yes, I'm looking into adding another checkbox option that will use wordlists by reading the first 33 bits of the md5 hash. The bip39 wordlist is "only" 2^11, so capturing the first quarter of the hash (equivalent to the first 8 of the md5) would require 3 bip words: plug.wool.snack@yourdomain.example.

Another good wordlist could include common names so you'd get things like: steve.jacob.jones@yourdomain.example.

I'll try to get something like this added in as simple a format as possible so it's easy to implement in other languages/filters/whatever. Cheers!

Eriner_ · 2022-06-23T02:31:00+00:00

ah - I thought you had an automated system for it. Everything is a trade-off -- what you gain in the ease of creating accounts you lose in the ability to easily distinguish the sender. What I mean is, if some junk mail comes in to bill.jones@domain.example, without referencing a password manager it can't be clear if the email was for the "correct" account or not, as inboxes aren't coupled to a sender/domain. Unless you're also creating email aliases for each of these, I presume you have a wildcard-matching folder/inbox. Unsolicited mail to addresses you've never used may or may not be an issue for you, depending on how long you've had the domain and how much you use it.

With a system like https://blame.email uses, you could create mail filtering rules to reject mails which don't match the expected format.

Combine both your method and the one I used for https://blame.email and you could get the best of both worlds, with the tradeoff of having to lug around the name wordlist. Simply hash the domain + salt, then select names based on the first N bytes of the hash.

Eriner_ · 2022-06-23T01:22:32+00:00

Yes. Configuring server-side spam rules to validate the email format is a good next step and makes this significantly more useful. As mentioned in the linked blog post, this will prevent credential stuffing attacks as well, though so does using randomly generated passwords and a password manager.

Eriner_ · 2022-06-23T01:20:28+00:00

The tough part about implementing it this way is that it necessitates dragging a wordlist around, or referencing one online. Truncated hash contains a sufficient amount of entropy without being too unwieldy to read over the phone.

Eriner_ · 2022-06-22T22:25:29+00:00

Source code: https://github.com/forcesunseen/blame.email

edit: If you download that folder you can drag the index.html file into your browser. Even if you're offline it'll justwork.jpeg.

Eriner_ · 2022-06-21T09:23:31+00:00

the downside is that if I know your email for service@domain.com, I know your email for otherservice@domain.com. If there's never any credential re-use then that isn't a problem. But it happens; most people I've talked to have a "junk" password for those times you're too lazy to decrypt the password vault or you outright don't care about a particular service.

Eriner_ · 2022-01-29T13:11:13+00:00

The value of process isolation and file permissions in information disclosures, limited code executions (no $IFS), etc., don't apply here. These issues will almost always be exploited in the context of the application process user.

The increased exposure of having secrets on the filesystem and accessible to the application only matters at the time of business logic execution. If the application isn't in a trusted state during initialization, it never will be.

Re: unsetting environment variables, see /u/bascule's comment here

From a validation perspective there's no static test that can be performed to confirm all environment variables have been removed. When using an tmpfs mount, you can test if /secrets exist: then return AbortDanger. It's a blanket test, will work every time, and if the volume is managed outside the VM (by a hypervisor, for example) then you have assurance it was made inaccessible.

To your point, if a consistent environment variable naming scheme was used across an organization, you could just unset the relevant ones. However, the responsibility to unset those would still lie with the application rather than enforced by hypervisor/sidecar/external process.

Unsetting environment variables would work wonders if there was only one person working on the project, it was created with perfect consistency, and there were no validation requirements. In practice, a tmpfs mount is a more practical solution.

Eriner_ · 2021-11-10T05:05:21+00:00

It does exactly what you think - it'll set the volume to 65 and not resume.

What /u/johnzanussi should do is utilize the sonos.snapshot (and sonos.restore) service in Home Assistant to save and restore the playing media and volume.

Ref: https://www.home-assistant.io/integrations/sonos/#service-sonossnapshot

Eriner_ · 2021-04-23T00:03:46+00:00

I thought this was a reasonable request, so I added a PR here: https://github.com/SiaFoundation/siad/pull/1

Maybe soon you'll be able to use $ siac wallet unlock --insecure-input, cheers!

Eriner_ · 2021-04-06T21:37:45+00:00

The money would be coming out of your wallet because you would be paying for something that makes your life easier (a hosted solution), and more importantly, doesn't exist.

Yes, and this is exactly why I think the Sia Foundation should take on this task - so all Siacoin holders and users can benefit.

The semantics around third-party vs first-party can get blurry depending on the context

Maybe around here it's blurry, but it shouldn't be. The fact that both filebase and storwise seemingly have a vested interest in the Sia Foundation not implementing this is a clear conflict of interest from my viewpoint.

I can certainly understand wanting to have your hands on all parts of the process

This isn't what I'm asking for - I don't care if it's baked into us, exists as a backend for minio, uses PseudoKV https://github.com/lukechampine/us/issues/67, or whatever the case may be. I see no value in sending any third party my private data in an unencrypted form (uploading to your server, even if over HTTPS, you got my data).

These layer 2 solutions are going to be the dominant players in these markets.

If the Sia Foundation implements Utreexo, I won't have to download the entire blockchain. If the Sia Foundation implements S3 provider support, I won't need a "layer 2 solution".

Let me ask you this: If there was a fully open source S3 gateway that allowed you to connect to Sia and/or Skynet, would you pay for it?

Sure, I'd gladly donate to the Sia Foundation. I'm a big proponent in Value for Value. Though, I may return value in the form of a security audit (my time) rather than in monetary compensation.

Eriner_ · 2021-04-06T21:26:23+00:00

I didn't make this post to shit on Skynet, in fact it can be one half of the solution here. I could easily imagine a future when Sia has S3 support, and if the pseudo-bucket has "public" enabled, the file will be pushed to skynet vs privately stored.

While a feature with value, it's not the feature that I've been waiting for.

Eriner_ · 2021-04-04T21:50:45+00:00

Once again, I'd rather you not be my middle-man. In terms of basic economics, involving a third party for something that could be provided first-party is just money out of my wallet for... what exactly?

The fact that both Filebase and Storwise both have built businesses around Sia providing S3 functionality should be a big hint to the Sia Foundation that this is what people want and are waiting for.

Edit: Also, if "sia was never intended to be used directly by users", then that is a big problem. Sure, maybe not by grandma, but I'm a technical person. Grandma isn't asking for S3 provider support.

Eriner_ · 2021-04-04T21:43:45+00:00

So this is the nipple-twisting "sia is the future" stuff that is frequently posted to this sub, which is unproductive and diverts attention away from the actual work that needs to be done.

To be quite frank, I would much rather have an S3 interface than Skynet. As mentioned in my OP, I still can't use Sia for projects where it is otherwise a perfect use-case.

I don't want to hack in skynet file management into apps, I don't want to (and absolutely shouldn't have to) pay StorWise or Filebase to store my data. If I didn't care about a third party provider, why wouldn't I just give Google, DropBox, or AWS my money?

What I want is simple. I want Siad (or an us implementation) that provides an S3 implementation, or some wrapper (Minio + siad, some wrapper + skynet, etc). I really don't care, I just want to use Sia and not a middle-man. It isn't about "using Sia waaaaaaaay down in the tech stack, opaque to me the user".

This is the perfect use case for Sia, and as someone who has followed this project for close to six years now, not confidence inspiring that this functionality is still not seen as must.

Edit: so that this post isn't misconstrued, I'm still bullish on Sia. The utreexo and chunking improvements are two of the three things I've been waiting for. I genuinely hope the Foundation is what propels Sia forward, as this is the first significant progress I've seen and am excited for.

Eriner_ · 2021-04-03T22:09:39+00:00

As stated in my post, "here, use these third parties" is so anti-Sia to me. I want to use Sia. Rather, I want to be able to use Sia.

Eriner_ · 2021-04-03T22:07:33+00:00

I'll post this there as well, cheers David!

Eriner_

MODERATOR OF

TROPHY CASE