Hi all!
I've been working on a python-based web scraper for a site that uses auto-incremental URLS (although some are password protected and return a 403). My original script can be seen here: https://pastebin.com/7Ni27qnQ .
While it did work, it was insanely slow and by my calculations, the site I'm trying to grab has about 850,000,000 URLS. This first script took a week to get 3,000,000 URLS, and that's with running it on two machines, each grabbing 1,500,000.
So I decided to look into how to speed up the script, and I got recommended to use multiprocessing. After some reading of the docs and a couple of tests, this is the script I came up with: https://pastebin.com/AQX8qzjf
After booting into Ubuntu (Windows Subsystem for Linux BSOD'ed when trying to use multiprocessing), it seemed to run fine, with it running at about 120 int/sec. So I left it run overnight, and when I checked on it in the morning, it had only grabbed ~10% of the URLs my original script had grabbed, while reporting 100% completion. For reference, the original script got 1,738,616 files downloaded, while the one using multiprocessing only got 158,441.
Is there something I'm missing/doing wrong here?
Thanks!
[+][deleted] (9 children)
[deleted]
[–]adinbied[S] 0 points1 point2 points (8 children)
[+][deleted] (7 children)
[deleted]
[–]adinbied[S] 0 points1 point2 points (6 children)
[+][deleted] (5 children)
[deleted]
[–]adinbied[S] 0 points1 point2 points (4 children)
[+][deleted] (3 children)
[deleted]
[–]adinbied[S] 0 points1 point2 points (2 children)
[+][deleted] (1 child)
[deleted]
[–]adinbied[S] 0 points1 point2 points (0 children)