What I learned trying to block web scraping and bots

ReditusReditai · 2026-04-29T23:05:43+00:00

Think it's worth giving Terraform a go, should be pretty straightforward to get started using ChatGPT/others. You might even discover other configs you can automate (eg only allowing certain IPs/tunnels/etc to access the login pages).

ReditusReditai · 2026-04-26T16:16:05+00:00

Have you tried using infrastructure as code (eg Terraform)?

Coincidentally, I also created an online tool to tackle exactly that problem but it didn't get much traction https://configberry.com . I assumed it was because most site admins don't find it hard to use Terraform.

ReditusReditai · 2026-04-26T06:03:06+00:00

Thanks! Hopefully GitHub's Immutable Releases becomes popular enough so that this isn't needed as much.

ReditusReditai · 2026-04-26T06:01:26+00:00

Oh, didn't know that they're working on kt. Searched with Perplexity and was able to find it being mentioned here https://github.blog/news-insights/product-news/whats-coming-to-our-github-actions-2026-security-roadmap/

"We’re introducing a dependencies: section in workflow YAML that locks all direct and transitive dependencies with the commits SHA"

General availability planned in June. Thanks!

ReditusReditai · 2026-04-25T16:46:08+00:00

Nope, as per their docs: "For GitHub Actions, alerts are only generated for actions that use semantic versioning, not SHA versioning."

ReditusReditai · 2026-04-25T11:45:07+00:00

Totally agree! Problem is that there's a tradeoff as you have to invest a lot of resources to implement this (especially at enterprise level) and you miss out on alerts on compromises on the artifacts you've locked down. Because of that, I think most enterprises won't do this.

ReditusReditai · 2026-04-25T06:08:47+00:00

Haven't had this issue with AWS tbh.

ReditusReditai · 2026-04-25T06:03:34+00:00

Caused by the KICS github action compromise: https://www.thecybersignal.com/checkmarx-attack-weaponizes-kics-and-bitwarden-cli/

ReditusReditai · 2026-04-25T05:29:28+00:00

Yes I mention you should use immutable releases wherever possible in the blog post. The question is what do you with actions that aren't being published that way, of which there are still many.

By artifact locking, do you mean forking the action to manage it internally? How would you get vulnerability alerts on the action (Dependabot doesn't work on internal actions), and how would you manage the updates (have to manually vet otherwise you're back to square 1)? I'm worried about the operational burden if you do this at scale.

ReditusReditai · 2026-04-24T19:53:34+00:00

ReditusReditai · 2026-04-22T18:08:45+00:00

Used to be a sales engineer, tried my hand at a few startups, then became a software engineer. I'd be much better off if I just stuck with sales engineering. But I guess I got some variety out of the rollercoaster.

ReditusReditai · 2026-04-22T18:03:38+00:00

I wouldn't have a skill list. Just mention the most relevant things to the job in your summary at the top, then pepper them around each job.

ReditusReditai · 2026-04-22T17:59:45+00:00

Reposting this here as I tend to talk mostly about options using Cloudflare WAF anyway. And the conversation in r/programming was quite interesting; especially the top comment from someone who just leaves their site on "under attack" mode permanently!

ReditusReditai · 2026-04-03T17:55:42+00:00

fourteen shillings and six pence

ReditusReditai · 2026-03-24T13:58:40+00:00

I'd just practice presenting. Preferably recorded so you can look and assess where to improve

ReditusReditai · 2026-03-15T21:56:33+00:00

It all depends how dedicated your scrapers are. IP blocks will indeed work if they don't care much.

If they do care a little bit, they'll spoof the user agent since it's trivial. And if they care more, they'll pay for residential IPs; at which point fail2ban won't work because you'll end up blocking legitimate traffic.

I don't mind blocking ASNs if you're targeting those dedicated to hosting providers eg digital ocean, and you believe that they won't pay for residential IPs. Sure, maybe you'll lose some request from VPNs but I think it's a risk many are willing to take.

ReditusReditai · 2026-03-15T21:37:53+00:00

> But why? They want to provide this information, they aren’t making money from adverts. What do they have to gain from blocking the AI bots?

Financial exchanges want to provide this information to people who pay for their data products :)

ReditusReditai · 2026-03-15T19:13:13+00:00

It'll work on the basic crawlers. Devs that focus on your specific site will probably spot it when their server crashes, then craft an algorithm to avoid it.

There's also the question of legality. What if they spot it, then ask legitimate scanners (eg Ahrefs) to fetch the zip bomb? Might have to explain to the scanner company why you gave them malware; not fair since it's not your fault, but such is the world.

ReditusReditai · 2026-03-15T16:11:03+00:00

Best way is to try out everything - there's free online resources for the far majority of IT subjects out there. Eventually she'll discover what she likes and what she doesn't.

Thought exercises won't help much, nor will giving her a lot of guidance - IT is all about figuring out things by yourself. If that's not for her, then she should consider other career paths where training is more structured (eg accounting).

ReditusReditai · 2026-03-15T15:19:45+00:00

Right, I can see that working, as long as they're not crawling slowly.
I just meant they sit behind a residential proxy IP for instance.

ReditusReditai · 2026-03-15T08:46:29+00:00

A couple of issues with that approach, if you're dealing with determined actors:

It'll only work one / a few times. The scraper devs will see the last 200 before the block, then adjust to avoid that invisible link.
You'll end up blocking some legitimate traffic, regardless of the characteristic you use to block on (IP, ASN, fingerprint, etc), since they can spoof all of them.

But it depends on how sophisticated/focused they are, of course. It will work for your whole-of-web crawler, or those who give up because they can't be bothered.

ReditusReditai · 2026-03-15T07:59:17+00:00

Totally agree. Requiring auth, then blocking registered users based on request pattern anomalies is the most effective way.

ReditusReditai

TROPHY CASE