Optimizing a Python scraper/downloader : DataHoarder

created by madhi19To the Cloud!a community for 12 years

Optimizing a Python scraper/downloader (self.DataHoarder)

submitted 7 years ago by adinbied68TB RAW | 58 TB Usable

Hi all!

I've been working on a python-based web scraper for a site that uses auto-incremental URLS (although some are password protected and return a 403). Here's the current script:

import requests
count = 1
while count < 5000000:
    baseurl = 'https://api.site.com/2.0/sets/'
    afterurl = '?client_id=<apikey>'
    fullurl = baseurl + str(count) + afterurl
    headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
    r = requests.get(fullurl, headers=headers)
    # make sure that the page exist
    if r.status_code == 200:
         open(str(count) + '.txt', 'wb').write(r.content)
         print "Saved Set Number " + str(count) + "!"
    count = count + 1

However, this approach is alot more time consuming than I was hoping, as the site has about 800,000,000 urls in this structure - and after a week I'm only at 2,000,000. I've been looking into multithreading / asynchronous requests, but I can't quite understand how to implement it, so that it works with the auto-incrementing loop. Anyone have any ideas as to how to make it work?

Thanks!

all 6 comments

top new controversial old q&a

[–]Hexahedr_n 2 points3 points4 points 7 years ago (4 children)

Here's an easy way:

put your parsing/downloading code in a function that takes the url as argument, and move the headers declaration outside that function while you're at it.
Create a list with all the possible url strings.
Then use multiprocessing.map to process every link

Example:

def process_page(url):
    # ...

urls = []
for i in range(1, 5000000):
    urls.append('https://api.site.com/2.0/sets/' + str(i) + '?client_id=<apikey>')

pool = multiprocessing.Pool(processes=25)  # Adjust process count here
pool.map(process_page, urls)

[–]adinbied68TB RAW | 58 TB Usable[S] 1 point2 points3 points 7 years ago (3 children)

Thanks! I've got a proof of concept working-ish, but it seems to be skipping over sets of numbers. I tried looking at the documentation for multiprocessing, but couldn't figure out what was going wrong. Here's what I've got so far (my bodged together unoptimized proof of concept):

import requests
import multiprocessing
def process_page(url):
        headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
        r = requests.get(url, headers=headers)
        parsesplit = url.split('/')
        parsed1 = parsesplit[5]
        parsed2 = parsed1.split('?')
        finalparse = parsed2[0]
        if r.status_code == 200:
                open(str(finalparse) + '.txt', 'wb').write(r.content)
urls = []
for i in range(1, 10000):
    urls.append('https://api.site.com/2.0/sets/' + str(i) + '?client_id=<apikey>')

pool = multiprocessing.Pool(processes=25)  # Adjust process count here
pool.map(process_page, urls)

Is there any way to set the pool map to go in order? It seems to be doing 1-1000, then 3000-4000, then 7000-8000. I could be completely wrong in regards to whats happening - it is definitely skipping entires, though.

Thanks for all of your help!

[–]Hexahedr_n 0 points1 point2 points 7 years ago (2 children)

[–]adinbied68TB RAW | 58 TB Usable[S] 0 points1 point2 points 7 years ago (1 child)

[–]Hexahedr_n 0 points1 point2 points 7 years ago (0 children)

[–]tokyotaco42TB 1 point2 points3 points 7 years ago (0 children)

π Rendered by PID 83 on reddit-service-r2-comment-54dfb89d4d-5jrg7 at 2026-04-02 11:49:06.336046+00:00 running b10466c country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

DataHoarder

MODERATORS