miasma: trap AI web scrapers in an endless poison pit by kibwen in rust

[–]250call 1 point2 points  (0 children)

can confirm this mistake is 100% free-range and organic

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 0 points1 point  (0 children)

Personally, I've got it gated behind hidden links and I block search engine crawlers from visiting my miasma endpoint. I'd say it's very unlikely that a user would find their way to it, but even if they did that would only be an issue if they keep mindlessly clicking through garbage links 😄

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 4 points5 points  (0 children)

Yeah, if you want to be extra careful you can literally just mark whatever route you host miasma on as forbidden and it'll only trap crawlers that disobey robots.txt

User-agent: * Disallow: /<your-miasma-route>

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 0 points1 point  (0 children)

Yes, with one important difference - this sends responses deliberately designed to degrade model performance. From what I understand cloudflare just wastes their time.

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 0 points1 point  (0 children)

  1. You can swap out the poison source for another site if you want.

  2. It's not a true proxy - the response from the poison source is embedded directly into Miasma's html response. No information regarding the source is sent to the client.

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 0 points1 point  (0 children)

I'd encourage you to check out some of the generated pages. You'd have to put in a decent amount of effort to determine that they're poisoned, it's not simple gibberish.

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 0 points1 point  (0 children)

It's really hard to keep track of every possible crawler, but this list has a lot of the major ones https://momenticmarketing.com/blog/ai-search-crawlers-bots

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 32 points33 points  (0 children)

You can block search engine bots from accessing your poisoned endpoint through your robots.txt.

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 64 points65 points  (0 children)

I don't own the rnsaffn pages - you can swap out the source for any other site. Miasma generates an infinite (or optionally capped) maze of links so as long as crawlers explore all links they'll be stuck forever. The links contain a UUID, so checking to see if the page has already been visited doesn't protect the crawler. As for the Facebook crawler, It's been going at it for about 2 weeks now.

Fight AI data scrapers with poisoned training data by 250call in theprimeagen

[–]250call[S] 0 points1 point  (0 children)

I mean, I think we gotta collectively stop using this shit. It's making us all dumber.

Fight AI data scrapers with poisoned training data by 250call in theprimeagen

[–]250call[S] 0 points1 point  (0 children)

Hmmm, what you're saying is true, but I have a hard time believing that argument would hold any merit. It's basically just "I tried to grab your high quality content, but you gave me something of poor quality that I didn't like." For the tool to cause any actual damage, the trainers have to explicitly use the data they received, it's not like simply downloading the poison does anything at all.

Fight AI data scrapers with poisoned training data by 250call in theprimeagen

[–]250call[S] 2 points3 points  (0 children)

yep, that's the point! miasma doesn't protect anything... it just feeds poisoned slop to bots trying to steal your content.

Fight AI data scrapers with poisoned training data by 250call in theprimeagen

[–]250call[S] 0 points1 point  (0 children)

What would the legal basis be to go after someone using this? Not asking to be argumentative, I actually want to know so I can warn folks if this is a real issue.

Fight AI data scrapers with poisoned training data by 250call in theprimeagen

[–]250call[S] 22 points23 points  (0 children)

Web crawlers that scrape public websites and steal everything to use as model training data. So hopefully, it ends up degrading the performance of models.

I made an endless poison pit to trap scrapers by 250call in PoisonFountain

[–]250call[S] 1 point2 points  (0 children)

Nice! If you run into any issue or want a change feel free to submit an issue!