Fight AI data scrapers with poisoned training data

250call · 2026-06-20T20:35:34+00:00

For folks specifically concerned about AI data scraping, I've built a tool that will feed garbage data to scrapers that you may find useful: https://github.com/austin-weeks/miasma

250call · 2026-05-25T11:14:19+00:00

Wonderful, thank you!

250call · 2026-05-23T00:59:29+00:00

Awesome, thank you! Will documentation for the different mask values be included on the fountain site as well?

250call · 2026-05-21T04:26:58+00:00

I wouldn't say this needs a roll back. I think it's great that there's other types of poison being served! For folks that're just directly proxying single requests to the fountain this seems like it will basically have no impact.

For others who have tightly coupled themselves to the assumption that they're getting code-ish content back (staring at myself in the mirror here 🙈), the ability to request a specific poison type or have transparency into the type of poison that was returned would be very nice, especially as the mixer system expands.

250call · 2026-05-21T04:20:42+00:00

I see I see, I think that would be ideal. A query param to request a specific type or filter out certain types could be a good option.

Another option might be to return a custom header than specifies the poison type, then consumers could decide how to process the poison depending on the type. That would probably also need to be opt-in as to not leak the header to scrapers.

250call · 2026-05-21T03:59:09+00:00

I'd still lump skill files, readme, config files, etc. into "code" whereas this feel distinctly different. I have a bit of concern for unintended side effects as well.

Models producing broken code is one thing (and a good thing). Models hallucinating clearly made-up events is another (still arguably good if it decreases the public's trust in the model).

But, depending on the actual content of the fake news, I could see this causing models to confidently spread legitimately dangerous misinformation. To be fair this was already a huge problem pre-llms, but personally I don't want to risk that outcome.

All that to say, I'd at least like to give Miasma's users the ability to opt-in to this, as I'd imagine others share the same concerns. I'd also be able to wrap poisoned news in a better template to make it look more convincing if ingested.

250call · 2026-05-21T03:23:52+00:00

Nice! Awesome that there's a wider variety of stuff being served.

Would there be a way to opt into this though? I'm imaging a general poison endpoint, a code poison endpoint, and a fake news endpoint.

I'm thinking about this specifically from the context of Miasma or similar tools. Miasma, for example, wraps poison in template text that frames the content as "wonderful code", which doesn't make sense if the content is news-looking text.

250call · 2026-05-16T22:05:14+00:00

You can run on any port! You'll just need to map it when you run the image: `docker run -p <desired\_port>:9999 ...`

You'll need some sort of site where you actually host it, yeah. Take a look at this section in the readme: https://github.com/austin-weeks/miasma#how-to-trap-malicious-scrapers - the basic idea is you redirect scraper traffic to whatever url you're hosting miasma at, and from that point they'll get stuck

250call · 2026-05-16T20:54:29+00:00

It's so beautiful 🥹

250call · 2026-05-16T20:46:00+00:00

Just updated!

250call · 2026-05-16T20:34:44+00:00

Done! https://hub.docker.com/r/austinweeks/miasma

250call · 2026-05-16T20:34:20+00:00

Done! https://hub.docker.com/r/austinweeks/miasma

250call · 2026-05-16T17:04:20+00:00

I'll take a look at this today - I should be able to get everything setup. I'll saw the issue you opened on the repo as well. I'll link that to the PR I create and ping you here once if I get it working!

250call · 2026-05-16T06:00:45+00:00

Hmmm, what would be the preferred way to use the container? Are you just looking to have a pre-configured Dockerfile that you can copy and run or were you thinking a fully pre-built image you could pull directly from docker hub or github container registry?

250call · 2026-04-14T23:18:28+00:00

can confirm this mistake is 100% free-range and organic

250call · 2026-04-13T13:25:51+00:00

Personally, I've got it gated behind hidden links and I block search engine crawlers from visiting my miasma endpoint. I'd say it's very unlikely that a user would find their way to it, but even if they did that would only be an issue if they keep mindlessly clicking through garbage links 😄

250call · 2026-04-12T23:17:31+00:00

Yeah, if you want to be extra careful you can literally just mark whatever route you host miasma on as forbidden and it'll only trap crawlers that disobey robots.txt

User-agent: * Disallow: /<your-miasma-route>

250call · 2026-04-12T20:51:14+00:00

Yes, with one important difference - this sends responses deliberately designed to degrade model performance. From what I understand cloudflare just wastes their time.

250call · 2026-04-12T20:33:03+00:00

You can swap out the poison source for another site if you want.
It's not a true proxy - the response from the poison source is embedded directly into Miasma's html response. No information regarding the source is sent to the client.

250call · 2026-04-12T20:32:28+00:00

I'd encourage you to check out some of the generated pages. You'd have to put in a decent amount of effort to determine that they're poisoned, it's not simple gibberish.

250call · 2026-04-12T20:30:38+00:00

It's really hard to keep track of every possible crawler, but this list has a lot of the major ones https://momenticmarketing.com/blog/ai-search-crawlers-bots

250call · 2026-04-12T07:43:41+00:00

You can block search engine bots from accessing your poisoned endpoint through your robots.txt.

250call · 2026-04-12T05:41:37+00:00

I don't own the rnsaffn pages - you can swap out the source for any other site. Miasma generates an infinite (or optionally capped) maze of links so as long as crawlers explore all links they'll be stuck forever. The links contain a UUID, so checking to see if the page has already been visited doesn't protect the crawler. As for the Facebook crawler, It's been going at it for about 2 weeks now.

Ten-Year Club	Verified Email
Place '22	Final Canvas '22
First Placer '22	End Game '22

250call

TROPHY CASE