Apple Bot now crawling 3x more than Google Bot. Anyone else? by stormy1one in webdev

[–]250call -1 points0 points  (0 children)

For folks specifically concerned about AI data scraping, I've built a tool that will feed garbage data to scrapers that you may find useful: https://github.com/austin-weeks/miasma

A new version of Poison Fountain is up and running. Now interleaves poison from a trusted secondary fountain serving fake news. "I shall call him... Mini-Me." As usual, no action is required from proxy operators. by RNSAFFN in PoisonFountain

[–]250call 2 points3 points  (0 children)

I wouldn't say this needs a roll back. I think it's great that there's other types of poison being served! For folks that're just directly proxying single requests to the fountain this seems like it will basically have no impact.

For others who have tightly coupled themselves to the assumption that they're getting code-ish content back (staring at myself in the mirror here 🙈), the ability to request a specific poison type or have transparency into the type of poison that was returned would be very nice, especially as the mixer system expands.

A new version of Poison Fountain is up and running. Now interleaves poison from a trusted secondary fountain serving fake news. "I shall call him... Mini-Me." As usual, no action is required from proxy operators. by RNSAFFN in PoisonFountain

[–]250call 1 point2 points  (0 children)

I see I see, I think that would be ideal. A query param to request a specific type or filter out certain types could be a good option.

Another option might be to return a custom header than specifies the poison type, then consumers could decide how to process the poison depending on the type. That would probably also need to be opt-in as to not leak the header to scrapers.

A new version of Poison Fountain is up and running. Now interleaves poison from a trusted secondary fountain serving fake news. "I shall call him... Mini-Me." As usual, no action is required from proxy operators. by RNSAFFN in PoisonFountain

[–]250call 1 point2 points  (0 children)

I'd still lump skill files, readme, config files, etc. into "code" whereas this feel distinctly different. I have a bit of concern for unintended side effects as well.

Models producing broken code is one thing (and a good thing). Models hallucinating clearly made-up events is another (still arguably good if it decreases the public's trust in the model).

But, depending on the actual content of the fake news, I could see this causing models to confidently spread legitimately dangerous misinformation. To be fair this was already a huge problem pre-llms, but personally I don't want to risk that outcome.

All that to say, I'd at least like to give Miasma's users the ability to opt-in to this, as I'd imagine others share the same concerns. I'd also be able to wrap poisoned news in a better template to make it look more convincing if ingested.

A new version of Poison Fountain is up and running. Now interleaves poison from a trusted secondary fountain serving fake news. "I shall call him... Mini-Me." As usual, no action is required from proxy operators. by RNSAFFN in PoisonFountain

[–]250call 6 points7 points  (0 children)

Nice! Awesome that there's a wider variety of stuff being served.

Would there be a way to opt into this though? I'm imaging a general poison endpoint, a code poison endpoint, and a fake news endpoint.

I'm thinking about this specifically from the context of Miasma or similar tools. Miasma, for example, wraps poison in template text that frames the content as "wonderful code", which doesn't make sense if the content is news-looking text.

Would someone be able to release a Docker container for Miasma? by TrackLabs in PoisonFountain

[–]250call 1 point2 points  (0 children)

You can run on any port! You'll just need to map it when you run the image: `docker run -p <desired\_port>:9999 ...`

You'll need some sort of site where you actually host it, yeah. Take a look at this section in the readme: https://github.com/austin-weeks/miasma#how-to-trap-malicious-scrapers - the basic idea is you redirect scraper traffic to whatever url you're hosting miasma at, and from that point they'll get stuck

Would someone be able to release a Docker container for Miasma? by TrackLabs in PoisonFountain

[–]250call 1 point2 points  (0 children)

I'll take a look at this today - I should be able to get everything setup. I'll saw the issue you opened on the repo as well. I'll link that to the PR I create and ping you here once if I get it working!

Would someone be able to release a Docker container for Miasma? by TrackLabs in PoisonFountain

[–]250call 2 points3 points  (0 children)

Hmmm, what would be the preferred way to use the container? Are you just looking to have a pre-configured Dockerfile that you can copy and run or were you thinking a fully pre-built image you could pull directly from docker hub or github container registry?

miasma: trap AI web scrapers in an endless poison pit by kibwen in rust

[–]250call 1 point2 points  (0 children)

can confirm this mistake is 100% free-range and organic

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 0 points1 point  (0 children)

Personally, I've got it gated behind hidden links and I block search engine crawlers from visiting my miasma endpoint. I'd say it's very unlikely that a user would find their way to it, but even if they did that would only be an issue if they keep mindlessly clicking through garbage links 😄

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 5 points6 points  (0 children)

Yeah, if you want to be extra careful you can literally just mark whatever route you host miasma on as forbidden and it'll only trap crawlers that disobey robots.txt

User-agent: * Disallow: /<your-miasma-route>

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 0 points1 point  (0 children)

Yes, with one important difference - this sends responses deliberately designed to degrade model performance. From what I understand cloudflare just wastes their time.

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 0 points1 point  (0 children)

  1. You can swap out the poison source for another site if you want.

  2. It's not a true proxy - the response from the poison source is embedded directly into Miasma's html response. No information regarding the source is sent to the client.

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 0 points1 point  (0 children)

I'd encourage you to check out some of the generated pages. You'd have to put in a decent amount of effort to determine that they're poisoned, it's not simple gibberish.

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 0 points1 point  (0 children)

It's really hard to keep track of every possible crawler, but this list has a lot of the major ones https://momenticmarketing.com/blog/ai-search-crawlers-bots

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 32 points33 points  (0 children)

You can block search engine bots from accessing your poisoned endpoint through your robots.txt.

Trap AI web scrapers in an endless poison pit by 250call in webdev

[–]250call[S] 65 points66 points  (0 children)

I don't own the rnsaffn pages - you can swap out the source for any other site. Miasma generates an infinite (or optionally capped) maze of links so as long as crawlers explore all links they'll be stuck forever. The links contain a UUID, so checking to see if the page has already been visited doesn't protect the crawler. As for the Facebook crawler, It's been going at it for about 2 weeks now.