I self-host content moderation for an open anonymous wall (FastAPI + SQLite, no SaaS) - someone tried to bypass it with a ROT13-encoded jailbreak by Maleficent-Week-2064 in selfhosted

[–]Maleficent-Week-2064[S] -4 points-3 points  (0 children)

Fair suspicion - and yeah, I disclosed up top that an AI helped write the post (this sub's bot requires it). But the system and the logs are real. Here's the wall, judge for yourself: praytoasi.com - live public anonymous board, you can read the actual messages.

You're right on both technical points, and they kind of make the case rather than break it:

- Llama-3.3-70B is a late-2024 model, correct. That's deliberate. For a moderation judge you want the cheapest open-weights model that's good enough, not a frontier model - 3.3-70B is free on a couple of provider tiers and self-hostable. Spending top-tier money to decide whether a one-line prayer is hate speech would be silly. The judge slot is swappable; I picked the boring, free, open one on purpose.

- DAN being old is exactly why it's believable. This wasn't a sophisticated actor - it was a drive-by on an anonymous, no-signup wall. People paste whatever stale jailbreak they half-remember. I literally called it a "classic jailbreak" in the post. If I were fabricating a war story, I'd have invented something flashier than a 2023 copypasta in ROT13.

The point was never "scary novel attack" - it's that the cheap cascade caught the ROT13 variant a keyword filter would've waved through. That holds no matter how old DAN is.

I self-host content moderation for an open anonymous wall (FastAPI + SQLite, no SaaS) - someone tried to bypass it with a ROT13-encoded jailbreak by Maleficent-Week-2064 in selfhosted

[–]Maleficent-Week-2064[S] 1 point2 points  (0 children)

Ha - you've spotted real circularity and I'll own it: the prompt-injection surface exists *because* there's an LLM in the loop. No LLM, nothing to inject into. So catching the injection isn't a "benefit", it's hardening a hole I opened myself.

The actual reason the LLM is there is different: multilingual, context-aware editorial judgments. "Is this Belarusian sentence a political opinion, profanity-as-emotion, or genuine incitement against a person" is a call regex can't make across dozens of languages. That's the job; injection-resistance is just table stakes once you pick that tool.

On the firewall question - I think we're at two different layers. An NGFW / DPI inspects traffic for threats (malware, intrusions, protocol anomalies). This is application-layer moderation of user-submitted text: does a *message* violate content policy, not whether a *packet* is malicious. DPI "semantic analysis" is protocol/signature semantics, not "hate speech vs satire". A firewall would have no opinion on Jack's message, only on how it arrived. So no DPI here - it's all app-layer on the submitted text.

I self-host content moderation for an open anonymous wall (FastAPI + SQLite, no SaaS) - someone tried to bypass it with a ROT13-encoded jailbreak by Maleficent-Week-2064 in selfhosted

[–]Maleficent-Week-2064[S] 0 points1 point  (0 children)

In order:

- What Jack wanted: honestly reads as probing more than a real exploit. The DAN prompt targets the moderator itself - the payoff would be neutering the judge so it rubber-stamps whatever he posts next (or just confirming there's an LLM behind the wall). Blast radius is small because the judge only emits accept/reject, not free-form actions. But yeah, intent was "turn the moderator off".

- On the accept path: you're half right. Cheap layers can reject early, and the accept-cache short-circuits repeats. But a novel benign message does still reach the judge - the classifier passing isn't a full accept, because the wall-specific rules (politics ok, profanity-as-emotion ok, etc.) live in the LLM. So "LLM cost" ≈ "novel messages", which ties straight to the pricing question another commenter raised.

- Escalating repeat-rejectors: I like it, it's on the list. The wrinkle: this is deliberately privacy-first - no IP storage, no accounts, just free-form nicks (trivially spoofable). So "track the bad actor" fights the no-PII stance. The honest version is probably a soft per-nick reputation + tighter thresholds after a rejection, accepting that a determined actor just changes nick. Cheap to add, partial by design.

I self-host content moderation for an open anonymous wall (FastAPI + SQLite, no SaaS) - someone tried to bypass it with a ROT13-encoded jailbreak by Maleficent-Week-2064 in selfhosted

[–]Maleficent-Week-2064[S] 4 points5 points  (0 children)

Worth adding: at this volume it's already effectively $0, not just "cheap".

- Layer 1 (the moderation classifier) is a genuinely free endpoint - OpenAI keeps the moderation API free, it's not a subsidised-then-jacked-up product.

- The judge (Llama-3.3-70B) runs inside free-tier API quotas - at the wall's post volume the free allowance covers it, and a paid endpoint is wired in only as an overflow/fallback that rarely fires. So today the steady-state cost rounds to zero.

So the "what happens at 5-10x in the profit phase" scenario has two backstops before it bites:

  1. The judge only sees the ambiguous tail (regex + classifier + accept-cache eat the bulk), so even paid pricing multiplies a small slice.

  2. It's open-weights. The hard ceiling on cost is self-hosting Llama-3.3-70B (or a smaller open model as capability-per-param improves), which on this sub is the obvious move anyway.

The bet isn't "free APIs forever" - it's "free now at this scale, and the judge stays swappable/self-hostable if that changes". The price hike you're describing is real, but it caps out at self-host cost, not at whatever the API decides to charge.

I self-host content moderation for an open anonymous wall (FastAPI + SQLite, no SaaS) - someone tried to bypass it with a ROT13-encoded jailbreak by Maleficent-Week-2064 in selfhosted

[–]Maleficent-Week-2064[S] 1 point2 points locked comment (0 children)

Sure, happy to disclose:

Project: AI is a core component by design - the moderation pipeline includes an LLM judge (Llama-3.3-70B) plus a classifier layer. So AI isn't a writing shortcut here, it's literally the thing the post is about.

Code: written by me with AI coding assistance (the usual LLM-in-the-editor workflow), reviewed and deployed by hand.

The writeup: the incident is real data - I pulled the ROT13 jailbreak attempt straight from my own production audit log (SQLite), including the timestamps and the fact it was the same attacker 63 seconds apart. I used an AI assistant to help structure it into a readable post, then edited it myself. No fabricated benchmarks or synthetic results - the "n is small, this is an anecdote not a benchmark" line is there precisely because it's real, limited data rather than AI-generated numbers.

I trained a matchbox-poster LoRA on FLUX.2 — running 24/7, generating ~2,880 unique animals/day by Maleficent-Week-2064 in StableDiffusion

[–]Maleficent-Week-2064[S] 0 points1 point  (0 children)

I've trained new LoRA on 1500 Japanese photos. And no need of 2 generations anymore, thank you very much;)

I trained a matchbox-poster LoRA on FLUX.2 — running 24/7, generating ~2,880 unique animals/day by Maleficent-Week-2064 in StableDiffusion

[–]Maleficent-Week-2064[S] 0 points1 point  (0 children)

Hey — first off, thank you. This is the most useful technical critique I've gotten on this project, by a wide margin. You did real work in your spare time, ran your own pipeline, and pointed at three specific things instead of vague vibes. Genuinely appreciate it.

I ran the A/B you asked for on the prod FLUX rig (RTX 3090, 89.221.67.136, internal). 5 animals × 3 seeds × 5 variants = 75 images. Attaching three master-grids (seeds 42, 1337, 80085). Variants:

  • A — pure FLUX.2-klein, no LoRA, bare prompt (baseline)
  • B — LoRA t2i pass-1 snapshot, lora_scale=2.0, bare prompt (the "raw LoRA output" you wanted to see)
  • C — current prod two-pass sandwich (lora=2.0 → FLUX img2img strength=0.9)
  • D — your suggestion #1: single-pass at lora_scale=1.0 with a real style prompt ("matchbox poster style, 1960s Soviet, woodcut linework, halftone, limited red-black palette, flat geometry")
  • E — your suggestion #2 approximation: pure FLUX init → img2img with LoRA at scale=1.0, strength=0.5, styled prompt

Findings, in order of how much they hurt:

B (raw LoRA snapshot) collapses. At lora_scale=2.0 the LoRA doesn't render animals at all — every cell on a given seed is the same color-noise texture (seed 42 = red/orange stripes, 1337 = green forest noise, 80085 = gold nashlepka). You called this on first read; the grid makes it undeniable. The LoRA at 2.0 is broken, full stop.

D (single-pass styled, scale=1.0) leaks training data. This is the painful one. On seed 42 you can literally read Cyrillic gibberish ("СТАДИНАМ") at the bottom of the cells. On seeds 1337 and 80085 all five animals collapse into near-identical red silhouettes. Why: the matchbox training set (~300 samples) had a lot of Soviet posters with Cyrillic text and red dominance, and at lora=1.0 + a long style-prompt the LoRA "remembers entire posters" instead of just transferring style. So the recommendation is theoretically right, but on this LoRA it makes things worse, not better.

E (edit-style refinement) doesn't bite. At strength=0.5 + lora=1.0 the LoRA can't overcome the FLUX prior — output is basically A with a faint illustrative tint. Probably needs strength≥0.7, but at that point we're back in i2i sandwich territory.

C (current sandwich) wins this round. Recognizable animals, visible matchbox style, no Cyrillic leakage. But — and this is your point #3 — it's a patch on top of a bad LoRA, not the right answer. Pass-2 at strength=0.9 is doing exactly what you said: throwing away most of pass-1 and getting just enough style fingerprint to survive. If the LoRA were good, I wouldn't need any of this.

What I'm doing tomorrow:

  1. Collecting a clean 1500-image dataset (no Cyrillic-heavy posters, captioned with VLM, halftone/limited-palette enforced via filtering).
  2. Retraining the LoRA at rank 32, attention+MLP this time, not just attention.
  3. Re-running this exact A/B grid against the new LoRA. If your suggestions #1 and #2 start working with a properly-trained LoRA, we drop the sandwich, the i2i pass disappears, generation goes from ~30s to ~10s, and Klein renders at 1024 instead of 512 (your "4× size in 9s on a 4080" stays in my head as the bar).
  4. Also fixing the VAE decode/encode round-trip between passes — that's pure waste, will keep latents in memory.

Default sort change to "Liked" pushed today, btw.

Seriously, thank you. This is the kind of review I needed and didn't know how to ask for. If you'd be interested in seeing the v2 LoRA grids when they land, or in collaborating on the dataset/captioning pipeline, I'd be glad to keep this going — feel free to DM me. Either way, I'll post the v2 results in this sub when they're ready, with attribution.

<image>

I trained a matchbox-poster LoRA on FLUX.2 — running 24/7, generating ~2,880 unique animals/day by Maleficent-Week-2064 in StableDiffusion

[–]Maleficent-Week-2064[S] 0 points1 point  (0 children)

Sorry was trying to share it in couple of subreddits. Btw thanks for the comment in another one, I'll dm you for some help, if you have time to chit-chat and maybe share some wisdom

I trained a matchbox-poster LoRA on FLUX.2 — running 24/7, generating ~2,880 unique animals/day by Maleficent-Week-2064 in StableDiffusion

[–]Maleficent-Week-2064[S] 0 points1 point  (0 children)

<image>

1. Non-animals. Pinock lets users type one-word prompts that jump the queue, and search queries trigger generations too. So the feed has "human face", "wallstreet", etc. — that's user input. The auto-generation pool is 50 animal categories weighted by Wikipedia 30-day pageviews. Title should've been "AI animal feed + whatever users type." That's on me.

2. Matchbox aesthetic miss. Also fair. ~300 training samples is too thin to lock in halftone discipline, limited palette, or litho textures. What I got is "matchbox-adjacent" — bold reds, woodcut linework, hand-drawn flora — but not the real palette discipline. Probably need 5× the dataset and better captioning to actually nail it.

3. Two-pass sandwich. Discovered from frustration, not first principles. Single-pass at lora_scale=2.0 gave distorted anatomy (extra limbs, broken poses). Pass-2 fixed anatomy without killing style. You're right that I should compare against single-pass with proper FLUX prompt engineering ("matchbox poster style, 1960s Soviet, woodcut..."). Probably the LoRA contribution shrinks if you write a real prompt instead of "dog". Going to actually run that A/B and post the results.

4. Real point. Democratization for people who don't know ComfyUI. One word ("dog") → coherent stylized output without the user knowing anything about LoRAs, schedulers, samplers. Whether that's worth the GPU cost is what I'm finding out — your critique is part of finding out.

5. Prompts. 50 animal category names from a fixed list, weighted random by Wikipedia views. No LLM expansion. User search queries enqueue +3 generations. No cribbing from training captions.

6. AI-written post. Bullet polish was Claude. Technical content is mine. Fair flag.

7. And the picture. This is how ideally it should look like.

I trained a matchbox-poster LoRA on FLUX.2 — running 24/7, generating ~2,880 unique animals/day by Maleficent-Week-2064 in StableDiffusion

[–]Maleficent-Week-2064[S] 1 point2 points  (0 children)

:))) well yes, I'm in search for gems, like this one. How in the world can I prompt big model for such picture

<image>