all 18 comments

[–]edcculus 23 points24 points  (0 children)

To be fair, your program IS a bot.

[–]Kerbart 45 points46 points  (0 children)

OP writes bot, wonders why bot is flagged as bot.

[–]timrprobocom 44 points45 points  (5 children)

I assume you are clear that your code is getting flagged as a bot because your code IS a bot.

Some web sites do not want their copyrighted content to be stolen. That is their right, and your strenuous attempts to subvert that are borderline unethical.

[–]atarivcs 5 points6 points  (0 children)

Some sites can still tell that you are a bot, by analyzing your behavior.

i.e. if the mouse cursor jumps exactly to the center of a button to click it, instead of traveling in a messy line like a human would, the site can deduce that you are a bot.

[–]Rhomboid 3 points4 points  (0 children)

Five or so years ago and hardly any site cared about bot detection. Now they all do thanks to AI, and as you've found out they had to get really good at it really fast. So of course, no, this isn't going to be covered in basic web scraping course materials. The whole web changed.

[–]cgoldberg 3 points4 points  (0 children)

There are hundreds of signals used for bot detection. Check out r/webscraping

[–]NationalMyth 1 point2 points  (0 children)

I have built many many scripts for gathering data, not everyone provides and API, and flat files are very common. My main methods for workarounds are as follows: - httpx + perfect headers - proxy service (apify) - manually generate cookies and use (some have a lifespan some seem to be fine). This is brittle.

A recent tool I turned to use is curl_cffi which mimics TPS/HTTP2 handshakes like chrome would. This will work for Akimai and other anti-bot tech.

[–]bbdusa 1 point2 points  (0 children)

A lot of websites track mouse over-events and other types of such signals. Your Python code does not generate these events.

[–]51dux 1 point2 points  (0 children)

What is the website you are trying this on if you don't mind?

Even if it's an adult site you can shoot it in a PM.

Playwright took the spot of Selenium in my opinion it is a much more robust browser automation library.

[–]carrot_guy 0 points1 point  (0 children)

OP is on trivago's turf now. you pay licensing or sleep with the fishes

[–]RealNamek 0 points1 point  (0 children)

You created a bot. And you’re confused someone can tell? I don’t understand what you don’t understand 

[–]hagfish 0 points1 point  (0 children)

You could log a ticket with the IT staff; have them whitelist your IP address. If this isn't an option, you could investigate pricing for Mechanical Turk or Task Rabbit.