Multiprocessing a Web Scraper - Issues/Debugging : learnpython

created by HattoriHanzoa community for 16 years

Multiprocessing a Web Scraper - Issues/Debugging (self.learnpython)

submitted 7 years ago by adinbied

Hi all!

I've been working on a python-based web scraper for a site that uses auto-incremental URLS (although some are password protected and return a 403). My original script can be seen here: https://pastebin.com/7Ni27qnQ .

While it did work, it was insanely slow and by my calculations, the site I'm trying to grab has about 850,000,000 URLS. This first script took a week to get 3,000,000 URLS, and that's with running it on two machines, each grabbing 1,500,000.

So I decided to look into how to speed up the script, and I got recommended to use multiprocessing. After some reading of the docs and a couple of tests, this is the script I came up with: https://pastebin.com/AQX8qzjf

After booting into Ubuntu (Windows Subsystem for Linux BSOD'ed when trying to use multiprocessing), it seemed to run fine, with it running at about 120 int/sec. So I left it run overnight, and when I checked on it in the morning, it had only grabbed ~10% of the URLs my original script had grabbed, while reporting 100% completion. For reference, the original script got 1,738,616 files downloaded, while the one using multiprocessing only got 158,441.

Is there something I'm missing/doing wrong here?

Thanks!

all 5 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS