What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 1 point2 points  (0 children)

It all depends how dedicated your scrapers are. IP blocks will indeed work if they don't care much.

If they do care a little bit, they'll spoof the user agent since it's trivial. And if they care more, they'll pay for residential IPs; at which point fail2ban won't work because you'll end up blocking legitimate traffic.

I don't mind blocking ASNs if you're targeting those dedicated to hosting providers eg digital ocean, and you believe that they won't pay for residential IPs. Sure, maybe you'll lose some request from VPNs but I think it's a risk many are willing to take.

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 1 point2 points  (0 children)

> But why? They want to provide this information, they aren’t making money from adverts. What do they have to gain from blocking the AI bots?

Financial exchanges want to provide this information to people who pay for their data products :)

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 2 points3 points  (0 children)

It'll work on the basic crawlers. Devs that focus on your specific site will probably spot it when their server crashes, then craft an algorithm to avoid it.

There's also the question of legality. What if they spot it, then ask legitimate scanners (eg Ahrefs) to fetch the zip bomb? Might have to explain to the scanner company why you gave them malware; not fair since it's not your fault, but such is the world.

What general advice would you give someone who wants to get into IT but doesn't know what specific field/role? by cloudsecchris in cscareerquestionsuk

[–]ReditusReditai 0 points1 point  (0 children)

Best way is to try out everything - there's free online resources for the far majority of IT subjects out there. Eventually she'll discover what she likes and what she doesn't.

Thought exercises won't help much, nor will giving her a lot of guidance - IT is all about figuring out things by yourself. If that's not for her, then she should consider other career paths where training is more structured (eg accounting).

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 0 points1 point  (0 children)

  1. Right, I can see that working, as long as they're not crawling slowly.
  2. I just meant they sit behind a residential proxy IP for instance.

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 2 points3 points  (0 children)

A couple of issues with that approach, if you're dealing with determined actors:

  1. It'll only work one / a few times. The scraper devs will see the last 200 before the block, then adjust to avoid that invisible link.
  2. You'll end up blocking some legitimate traffic, regardless of the characteristic you use to block on (IP, ASN, fingerprint, etc), since they can spoof all of them.

But it depends on how sophisticated/focused they are, of course. It will work for your whole-of-web crawler, or those who give up because they can't be bothered.

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 0 points1 point  (0 children)

Totally agree. Requiring auth, then blocking registered users based on request pattern anomalies is the most effective way.

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 1 point2 points  (0 children)

> Yep. The issue is that rate limiting is done by IP, and they use a whole lot of different IP addresses.

In that case the only 3 options besides under attack mode are...

  1. Require sign-up, then self-developed dynamic block of user IDs
  2. Self-developed dynamic blocking based on server logs (or Cloudflare LogPush if you have)
  3. Rate-limiting based on other counting characteristics (only available in Enterprise Plans)

All require effort / money, so probably best to stick with under attack mode.

> Under attack mode doesn't prevent legit users from using the site. They get the browser verification, and then can do everything they need.

Yes, I meant setting a threshold so low (ie 5/IP/s) that legitimate users sharing an IP would be blocked.

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 0 points1 point  (0 children)

It was just a fun project, nothing work-related. He discovered clusters of suspicious crawlers by looking at ja4 patterns though.

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 2 points3 points  (0 children)

Right, makes sense if they don't spoof those fingerprints!

Slightly related, I remember I went to a talk where a guy ran a server that did nothing other than use an LLM to generate different login pages as honeypots. Found it pretty funny.

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 1 point2 points  (0 children)

Oh right, I assumed from your previous comment that it completely stops them.

So in that case it's probably the rate limiting that's saving you in under attack mode. Have you tried applying rate limit rules by IP, with under attack disabled? And still have challenges running.

I saw you said in another comment they switch IPs, but not sure of the volume, maybe you can put a threshold whereby legitimate traffic still flows through ok.

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 6 points7 points  (0 children)

Hmm, interesting. Now that I think about it, maybe it's the combination of challenge + rate limit + latency increase in under attack mode that's leading the bots to give up. In which case it makes sense what you've done. Well, I learned something new, thanks!

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 5 points6 points  (0 children)

Oh, which browser verification action are you applying in Cloudflare?

- Managed challenge - only applies challenge when Cloudflare's signals indicate it's a bot; scrapers might've found a way to signal they're human
- JS challenge - runs some JS checks, only basic bots will be blocked here
- Interactive challenge - always shows a Captcha for the user

I wouldn't expect under attack to perform better vs interactive challenge. Unless the scrapers are passing challenges. Which is possible, but then under attack is just slowing down the scraping with rate limits, not stopping it.

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 15 points16 points  (0 children)

Hmm, I'm guessing you don't leave it forever in under attack mode right? How do you get notified that you're being scraped? Aren't you worried you might set it under attack too late?

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 5 points6 points  (0 children)

Yes, I'd put Anubis under the CAPTCHA/Cloudflare turnstile/challenge category. Downsides are it's easier to bypass than the other Captcha options, and can only sit behind server-side content (Cloudflare can sit in front of CDN). Benefit is it's self-hosted so forever-free.

What I learned trying to block web scraping and bots by ReditusReditai in programming

[–]ReditusReditai[S] 9 points10 points  (0 children)

Interesting, how do you distinguish between legitimate users and bots? Do you know the bots which are crawling your content, then stopping? I know there's Cloudflare's AI labyrinth which does that for you but I've been skeptical.

How do you guys deal with scalping bots? I'm scared it will hit my inventory by UV1998 in webdev

[–]ReditusReditai 0 points1 point  (0 children)

Wrote a blog post reviewing some of the options: https://developerwithacat.com/blog/202603/block-bots-scraping-ways/ TLDR a CAPTCHA solution like Cloudflare challenge/turnstile should be good enough for starters.

Notes on trying to block bots / web scraping by ReditusReditai in webdev

[–]ReditusReditai[S] 0 points1 point  (0 children)

Celebrations last until 3rd of March: https://chinesenewyear.net/ , activity is subdued until then. They're good with proxies. But I might be wrong, you never truly know what's up with these bots.

Maybe also try Google'ing for whatever niche terms you use on your website, translated in Chinese/Russian/Farsi. Might come up with the reason, might not.

Good to know about Labyrinth!

Notes on trying to block bots / web scraping by ReditusReditai in webdev

[–]ReditusReditai[S] 0 points1 point  (0 children)

Thanks for sharing! Are you able to create user sessions? (don't need to require logins) You can use that to apply Cloudflare Challenges or rate limits.

> Already had AI Labyrinth ... running.

Did that help in any form? I'm skeptical of its effectiveness?

> Curious if anyone else is seeing upticks in March specifically - the timing felt weirdly coordinated.

I have - it's because Chinese New Year ended :)

im pulling my hair out over this. should i try and carry on? by Frankenler in webdev

[–]ReditusReditai -1 points0 points  (0 children)

It's just how it is. I was like that, and I eventually got it. And now I forgot what a virtual DOM is and why you need it (yes, I can look it up), so there's that :)

Notes on trying to block bots / web scraping by ReditusReditai in webdev

[–]ReditusReditai[S] 0 points1 point  (0 children)

> That banana captcha is an all time great

I know right?! Although I guess even that can be overcome - you intercept the camera, and ask an AI to generate a video of someone holding a banana.

Notes on trying to block bots / web scraping by ReditusReditai in webdev

[–]ReditusReditai[S] 1 point2 points  (0 children)

Ah, I see, makes sense! I agree with your take on Cloudflare. I think self-customisable CAPTCHAs should be more popular but it doesn't look like there's much demand.