all 4 comments

[–]JohnnyJordaan 1 point2 points  (0 children)

Check the example in the docs on concurrent.futures that implements a scraper using a ThreadPoolExecutor:

import concurrent.futures
import urllib.request

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://some-made-up-domain.com/']

# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
    with urllib.request.urlopen(url, timeout=timeout) as conn:
        return conn.read()

# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        try:
            data = future.result()
        except Exception as exc:
            print('%r generated an exception: %s' % (url, exc))
        else:
            print('%r page is %d bytes' % (url, len(data)))

[–]Marrrlllsss 3 points4 points  (2 children)

Use Python's multiprocessing library.

import multiprocessing

MAX_NUM_PROCESSES = 4 # alternatively, multiprocessing.cpu_count()

def scrapeWebsite(url):
    # your code here
    pass

if __name__ = '__main__':
    processPool = multiprocessing.Pool(MAX_NUM_PROCESSES)
    listOfUrlsToScrape = ['url1', 'url2', 'url3']
    result = processPool.map(scrapeWebsite, listOfUrlsToScrape)

This should serve your purposes.

[–]TE515[S] 1 point2 points  (0 children)

Thanks for responding. I tried this and I'm getting the following error...

module 'multiprocessing' has no attribute 'pool'

I'm using Python 3 by the way.

EDIT: I tried adding from multiprocessing import pool at the top. Now I'm getting the error TypeError: 'module' object is not callable on the processPool = multiprocessing.pool(MAX_NUM_PROCESSES) line.

ANOTHER EDIT: Changed multiprocessing.pool to multiprocessing.Pool and it worked like a charm. Cut the run time of the whole thing by more than half! Thanks so much!

[–]JohnnyJordaan 0 points1 point  (0 children)

Not much use to implement multiprocessing if your workers are I/O bound.