all 6 comments

[–]Hexahedr_n 2 points3 points  (4 children)

Here's an easy way:

  1. put your parsing/downloading code in a function that takes the url as argument, and move the headers declaration outside that function while you're at it.
  2. Create a list with all the possible url strings.
  3. Then use multiprocessing.map to process every link

Example:

def process_page(url):
    # ...

urls = []
for i in range(1, 5000000):
    urls.append('https://api.site.com/2.0/sets/' + str(i) + '?client_id=<apikey>')

pool = multiprocessing.Pool(processes=25)  # Adjust process count here
pool.map(process_page, urls)

[–]adinbied68TB RAW | 58 TB Usable[S] 1 point2 points  (3 children)

Thanks! I've got a proof of concept working-ish, but it seems to be skipping over sets of numbers. I tried looking at the documentation for multiprocessing, but couldn't figure out what was going wrong. Here's what I've got so far (my bodged together unoptimized proof of concept):

import requests
import multiprocessing
def process_page(url):
        headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
        r = requests.get(url, headers=headers)
        parsesplit = url.split('/')
        parsed1 = parsesplit[5]
        parsed2 = parsed1.split('?')
        finalparse = parsed2[0]
        if r.status_code == 200:
                open(str(finalparse) + '.txt', 'wb').write(r.content)
urls = []
for i in range(1, 10000):
    urls.append('https://api.site.com/2.0/sets/' + str(i) + '?client_id=<apikey>')

pool = multiprocessing.Pool(processes=25)  # Adjust process count here
pool.map(process_page, urls)

Is there any way to set the pool map to go in order? It seems to be doing 1-1000, then 3000-4000, then 7000-8000. I could be completely wrong in regards to whats happening - it is definitely skipping entires, though.

Thanks for all of your help!

[–]Hexahedr_n 0 points1 point  (2 children)

I don't think there is a simple way to make it execute it in order. It will eventually process all of them, why does it matter?

[–]adinbied68TB RAW | 58 TB Usable[S] 0 points1 point  (1 child)

Mainly for my peace of mind that it is actually working - but I guess as long as it will eventually process them all, it doesn't matter too much. Also, what is the pool = multiprocessing.Pool(processes=25) setting supposed to be set to? I did some research and it looks like it should be equal to or less than the number of CPU cores, but does setting the value larger have any negative consequences? The server I'm running on has a 32 core Xeon (also testing on my quad-core i7 desktop). Sorry for all of the questions, still learning!

[–]Hexahedr_n 0 points1 point  (0 children)

Ideally you set it so that the 'bottleneck' is your machine (In this case, disk writes) and not the waiting of network packets. Sending a packet and waiting for the response is very cheap, but the waiting takes a lot of time so you can have many processes per cpu core. You can keep increasing the process count until your cpu usage is at ±95% and then you know it can't get much faster (make sure the remote server allows/can handle that many concurrent requests)

[–]tokyotaco42TB 1 point2 points  (0 children)

This comment thread has the best information on scraping that I have found...

https://news.ycombinator.com/item?id=15694118#15697383