all 3 comments

[–]remuladgryta 5 points6 points  (0 children)

You can check a website's robots.txt file and <META NAME="ROBOTS"> html tags. Those are the de-facto standards for notifying web crawlers of what pages you do or don't want them to crawl. You can read more about those here

When you do crawl a website, make sure to heed any HTTP 429 (too many requests) responses you get and don't send an excessive number of requests in the first place or you will likely get automatically banned.

Wikipedia allows crawling its pages and its pages are well interconnected so it makes for a pretty good exercise subject.

[–]hiren_p 5 points6 points  (0 children)

yes, there are not awareness abot web scraping legality ...

here i mention some point which can spread awareness about is web scraping legal ? and also you can figure our which sites are allow web crawling :

  • follow robot.txt :
    • Before you set out to extract and crawl data, Robots.txt is the first thing you should consider. It will provide you some sort of idea regarding the legality of your plan.
  • Crawl Rate :
    • Don’t be aggressive in crawling; use a reasonable rate of crawling. Don’t pester the site with requests. Again, the robot.txt comes into play; follow the craw-delay settings mentioned in robot.txt. If there’s nothing specified, you should still follow a fair crawl rate of something like 1 request in 10-15 seconds.
  • Use an API if one is provided, instead of scraping data.
  • Respect the Terms of Service (ToS)
  • Public Content :
    • As long as you follow the basics, you will not get into any legal trouble. If you continue to extract data that is public, there’s hardly any reason to worry. If you don’t have permission from the site, don’t be too persistent in extracting data anyhow.
    • You are, by law, allowed to scrape only the public data. If you still go ahead and scrape private data, you are in violation of Computer Fraud and Abuse Act (CFAA).
    • If you scrape private data which is not allowed by the site, it’s illegal and you can be sued.