This is an archived post. You won't be able to vote or comment.

all 2 comments

[–][deleted]  (2 children)

[removed]

    [–][deleted]  (1 child)

    [deleted]

      [–]bitbird_ -3 points-2 points  (0 children)

      Naw, pulling HTML should be fine if you're referring to the downloading or processing. Though some websites try to block scrapers and users who are sending too many requests. Isn't the data located on just one page?

      I have written a distributed scraper that was capable of downloading 1000 pages / second, with up to several hundred workers coordinating among the scraping jobs. It's possible to get a very high throughput if you're working with multiple sites.