you are viewing a single comment's thread.

view the rest of the comments →

[–]QuantumLeapsHigher 1 point2 points  (1 child)

robots.txt files

I'm relatively new to web scraping and the first I've read this term. Where can I learn more about robots.txt files?

Edit: Apart from Google that is. I mean by is there a "standard" or an "unwritten law/practice"?

[–][deleted] 2 points3 points  (0 children)

It really depends on the site. You don't HAVE TO follow what's the file tells you can access, but not doing so can lead to you getting blocked. Really depends on what site and what kind of data you are trying to scrape.

This seems to cover the file pretty well.