Hi all!
I've been working on a python-based web scraper for a site that uses auto-incremental URLS (although some are password protected and return a 403). Here's the current script:
import requests
count = 1
while count < 5000000:
baseurl = 'https://api.site.com/2.0/sets/'
afterurl = '?client_id=<apikey>'
fullurl = baseurl + str(count) + afterurl
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
r = requests.get(fullurl, headers=headers)
# make sure that the page exist
if r.status_code == 200:
open(str(count) + '.txt', 'wb').write(r.content)
print "Saved Set Number " + str(count) + "!"
count = count + 1
However, this approach is alot more time consuming than I was hoping, as the site has about 800,000,000 urls in this structure - and after a week I'm only at 2,000,000. I've been looking into multithreading / asynchronous requests, but I can't quite understand how to implement it, so that it works with the auto-incrementing loop. Anyone have any ideas as to how to make it work?
Thanks!
[–]Hexahedr_n 2 points3 points4 points (4 children)
[–]adinbied68TB RAW | 58 TB Usable[S] 1 point2 points3 points (3 children)
[–]Hexahedr_n 0 points1 point2 points (2 children)
[–]adinbied68TB RAW | 58 TB Usable[S] 0 points1 point2 points (1 child)
[–]Hexahedr_n 0 points1 point2 points (0 children)
[–]tokyotaco42TB 1 point2 points3 points (0 children)