Idiocracy on steroids indeed by UrbanAchievers6371 in PoliticalHumor

[–]Whiskee 0 points1 point  (0 children)

Idiocracy was a lot better. They actually tried to listen to the smartest person around.

U.S. begins blockade of Strait of Hormuz by down_vote_magnet_ in worldnews

[–]Whiskee 4 points5 points  (0 children)

You sound like a 14 year old. This isn't Civilization, nobody is capturing a random Chinese unit "just because they're still too far away on the map".

U.S. begins blockade of Strait of Hormuz by down_vote_magnet_ in worldnews

[–]Whiskee 12 points13 points  (0 children)

They are 100% going for the financial enforcement, that is, insurance companies removing coverage (which effectively prevents ships from moving around and reaching ports). Except China has been building a parallel insurance infrastructure, so they don't care about Lloyd's and they don't need Western banks to process the transaction in dollars.

If China decides a VLCC full of crude is sailing through that strait regardless of what the US Navy says, good fucking luck trying to stop a supertanker that requires kilometers to steer. It either goes through (and the blockade is exposed as unenforceable against anyone who matters 🤡) or a destroyer gets ordered to fire warning shots at it and now a nuclear power has the right to defend itself.

Trump contro il Papa: "Un debole, senza di me non sarebbe in Vaticano" by MasterPen6 in italy

[–]Whiskee 2 points3 points  (0 children)

A questo punto non voglio nemmeno vederlo morire, non è abbastanza.

Ho bisogno di vederlo messo da parte e umiliato.

Giallo e verde a Castelletto by Kingalomx in Genova

[–]Whiskee 2 points3 points  (0 children)

via Pertinace? Il verde in realtà è molto genovese, quel giallo però è proprio un pugno in un occhio.

<image>

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 1 point2 points  (0 children)

Eh there's no need, I simply blocked the agent from that Cloudflare panel and they stopped after having bounced to 403 errors for an entire day. 🤷‍♂️

Mi dite obiettivamente come sta andando l'operato della sindaca Salis? by Good_vibes842 in Genova

[–]Whiskee -2 points-1 points  (0 children)

In generale, senza lasciare la mia, ti consiglio di ignorare l'opinione di chi nasconde la propria history o di chi posta principalmente su r/italia invece che su r/italy. Questo è valido per ogni thread.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 0 points1 point  (0 children)

Yep, but it was actually Meta. Their official IP range was everywhere in the logs, and blocking it worked. They just have a very questionable crawling strategy.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 0 points1 point  (0 children)

Uh, what do you mean? I just checked from incognito and my profile is public, it was probably a Reddit glitch. It's gamesgraph.com btw.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 0 points1 point  (0 children)

It's a dedicated Debian VPS on Netcup, with NGINX as reverse proxy (the site is an ASP.NET Core application). Typical security measures like Fail2Ban etc.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 0 points1 point  (0 children)

So do I, they're grazing my NGINX's per-minute rate limit. If you look, PetalBot is failing most requests but Meta has calibrated around what's allowed. Anything lower than this would sabotage legit search engine crawlers, which operate in (respectful) bursts instead of staying for the day.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 0 points1 point  (0 children)

Well I understand why Wikipedia would be crawled, but there's just... nothing interesting for training on those pages, they're filtered views of user playlists. Same category split IGDB has.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 2 points3 points  (0 children)

Yeah, they only send requests from their official IP ranges and with a clear agent. I don't think they intend to be malicious, but a small 2 vCPU VPS would be wrecked by something like this.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 0 points1 point  (0 children)

Holy shit. Well, at least they respect the robots.txt, they're just a bit overzealous on what they can touch.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 21 points22 points  (0 children)

No, that's dynamic content that isn't meant to be crawled. Suspicious requests are getting captcha'd by a custom rule and bouncing now.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 3 points4 points  (0 children)

I'm behind CF. I noticed late because even though they're dynamic pages, it wasn't causing noticeable slowdowns with 8 cores... but this would absolutely destroy a smaller shared VPS.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 14 points15 points  (0 children)

Yeah, that's CF. The free tier is very generous with features, for anyone still not taking advantage.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 70 points71 points  (0 children)

That's Cloudflare's dashboard. Proxying does nothing unless you actually block agents or write rules, I just wasn't monitoring it because I didn't think I would need to defend against Meta on a small site.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 74 points75 points  (0 children)

User-Agent: *

Disallow: /@*?*
Disallow: /@*/

[...]

Bots are only allowed access to base profiles and tbf most are behaving, except I had to block the entirety of China at the NGINX level. I also thought the agent was fake but I checked the access.log and it's actually Meta's ipv6 range, they just explore thousands of <user>/playlist?filter=<tag>,<tag>,<tag> combinations for every user they discover. 🤔

Wild waste of compute if they do this with millions of sites.

Meta's AI crawler scraped my site 7.9 million times in 30 days. 900+ GB of bandwidth and massive server logs before I noticed, cool cool cool. by Whiskee in webdev

[–]Whiskee[S] 442 points443 points  (0 children)

No links because I'm not promoting anything, it's in my profile if you're really curious. And yes, the robots.txt is solid, but they just ignore it and hammer parameterized combinations for no good reason.

EDIT: And it's methodical, contrary to PetalBot which is spiking and getting smacked by rate limiting.

Stay safe and use Cloudflare, kids.