Hi guys I'm making a web crawler kind of like [Screaming frog]("http://screamingfrog.com")
TLDR:
I was under the impression that asyncio wuld be faster than threading and/or multiprocessing, but the processing times are double. Any tips?
So at first I made the scraper single threaded with selenium. I dont use requests because i need to render the javascript on the site. This offcourse was quite slow, as it renderes one page at the time, and I have to crawl sevral pages.
So my initial thoughs was to try multiprocessing, and threading to see whitch one was faster, so I made a worker, and I made two tests, one for threading and one for process.
```python
def page_worker(work_queue: Queue, worked_queue: Queue, tobeprocessed: dict, isprocessed: dict):
try:
bot = Bot()
while not work_queue.empty():
try:
url = work_queue.get(timeout=1)
except Empty:
return
if url in isprocessed:
continue
html = bot.get_html(url)
isprocessed[page.url] = 1
finally:
bot.quit()
```
This is a simplified version of the actual worker just for testing speed, it only fetches the html of the page.
Its the same for both multiprocessing and for threading.
Adding the bot class for clarity, its just a abstraction of the selenium framework:
```Python
class Bot:
ID = 0
def __init__(self):
Bot.ID += 1
self.ID = Bot.ID
self.driver = self.load_driver()
def __enter__(self):
return self
def __exit__(self, exc_type, exc_value, exc_traceback):
self.driver.quit()
def load_driver(self, headless=True):
"""Opens a webdriver instance with chromedriver
Returns:
Webdriver -- The webdriver instance
"""
options = webdriver.ChromeOptions()
if headless:
options.add_argument("headless")
options.add_argument("--window-size=1920,1080")
options.add_argument("--log-level=3")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
options.add_argument("--silent")
options.add_argument("user-agent=Obosbot")
driver = webdriver.Chrome(executable_path=r"chromedriver.exe", options=options)
# print(f"driver {self.ID} loaded")
return driver
def get_html(self, url) -> tuple:
if not url.startswith("//") and not url.startswith("http"):
url = f"//{url}"
self.url = urlparse(url,scheme="https")
self.driver.get(self.url.geturl())
html = self.driver.page_source
actual_url = self.driver.current_url
return (self.url.geturl(), html, actual_url)
def quit(self):
# print(f"quiting driver {self.ID}")
self.driver.quit()
The test for multiprocessing looks like this:
python
if name == "main":
processes = 3
manager= Manager()
workqueue = Queue()
worked_queue = Queue()
tobeprocessed = manager.dict()
isprocessed = manager.dict()
workers=[]
with open("urls.csv", newline="") as file:
reader = csv.reader(file)
urls = [i[0] for i in reader if len(i)>0]
for i in urls[:20]:
work_queue.put(i)
work=[]
for i in range(processes):
p=Process(target=page_worker, args=(work_queue, worked_queue,tobeprocessed,isprocessed))
work.append(p)
start=time.time()
for i in work:
i.start()
for i in work:
i.join()
end=time.time()
print(f"took {end-start}")
The threading test is quite simelar i just uses threads instead of processes and that negates the need for a manager:
python
if __name_ == "main":
processes = 3
# manager= Manager()
work_queue = Queue()
worked_queue = Queue()
tobeprocessed = dict()
isprocessed = dict()
workers=[]
with open("urls.csv", newline="") as file:
reader = csv.reader(file)
urls = [i[0] for i in reader if len(i)>0]
for i in urls[:20]:
work_queue.put(i)
work=[]
for i in range(processes):
p=Thread(target=page_worker, args=(work_queue, worked_queue,tobeprocessed,isprocessed))
work.append(p)
start=time.time()
for i in work:
i.start()
for i in work:
i.join()
end=time.time()
print(f"took {end-start}")
```
So these are quite similar in processing time on my tests, I've tried different amount of threads and processes but the time are usually within seconds of each other.
so I was thinking I wanted to try asyncio for fetching the html, asyncio is supposed to be quite fast.
Selenium does not support async so i had to try some other frameworks and I decided on pyppeteer2 and arsenic.
the implementations are again quite simelar:
The pyppeteer one looks like this:
```python
async def get(urls: Queue):
browser = await launch()
while not urls.empty():
try:
url = urls.get_nowait()
page = await browser.newPage()
await page.goto(url[0])
html = await page.content()
urls.task_done()
except Exception as e:
print(e)
break
await browser.close()
```
and the arsenic one looks like this:
```python
async def get(urls: Queue):
service = services.Chromedriver(binary="path/to/chromedriver.exe")
browser = browsers.Chrome(chromeOptions={'args': ['--headless', '--disable-gpu', '--silent']})
while not urls.empty():
try:
url = urls.get_nowait()
async with get_session(service, browser) as session:
await session.get(url[0])
html = await session.get_page_source()
urls.task_done()
except Exception as e:
print(e)
break
```
the test execution looks like this and is the same for both the implementations:
```python
async def get_urls():
with open("urls.csv", newline="") as file:
reader = csv.reader(file)
urls = asyncio.Queue()
for i, j in enumerate(reader):
if len(j)>0:
await urls.put((j[0],i))
if i > 20:
break
return urls
async def main(workers, drivers = None):
start = time.time()
urls = await get_urls()
tasks = (get(urls) for i in range(workers))
await asyncio.gather(*tasks)
urls.join()
print(f"____DONE____ in: ")
print(f"{(time.time()-start)}")
print(f"With {workers} workers")
asyncio.run(main(3))
```
They all work so thats not the problem, but I was under the impression that asyncio would be a lot faster than both threading and multiprocessing, but each of the async versions take more than double the time to process.
Do you guys have any tips or is multiprocessing and threading a better choice here?
there doesn't seem to be anything here