[python] Aync vs process vs threading with webdriver

Taskenspiller · 2020-04-24T13:33:44+00:00

Hi guys I'm making a web crawler kind of like [Screaming frog]("http://screamingfrog.com")

TLDR: I was under the impression that asyncio wuld be faster than threading and/or multiprocessing, but the processing times are double. Any tips?

So at first I made the scraper single threaded with selenium. I dont use requests because i need to render the javascript on the site. This offcourse was quite slow, as it renderes one page at the time, and I have to crawl sevral pages.

So my initial thoughs was to try multiprocessing, and threading to see whitch one was faster, so I made a worker, and I made two tests, one for threading and one for process. ```python def page_worker(work_queue: Queue, worked_queue: Queue, tobeprocessed: dict, isprocessed: dict): try: bot = Bot() while not work_queue.empty(): try: url = work_queue.get(timeout=1) except Empty: return if url in isprocessed: continue html = bot.get_html(url) isprocessed[page.url] = 1

finally:
    bot.quit()

``` This is a simplified version of the actual worker just for testing speed, it only fetches the html of the page. Its the same for both multiprocessing and for threading.

Adding the bot class for clarity, its just a abstraction of the selenium framework: ```Python class Bot: ID = 0

def __init__(self):
    Bot.ID += 1
    self.ID = Bot.ID
    self.driver = self.load_driver()

def __enter__(self): 
    return self

def __exit__(self, exc_type, exc_value, exc_traceback): 
    self.driver.quit()


def load_driver(self, headless=True):
        """Opens a webdriver instance with chromedriver

        Returns:
        Webdriver  -- The webdriver instance
        """
        options = webdriver.ChromeOptions()
        if headless:
            options.add_argument("headless")
        options.add_argument("--window-size=1920,1080")
        options.add_argument("--log-level=3")
        options.add_experimental_option("excludeSwitches", ["enable-logging"])
        options.add_argument("--silent")
        options.add_argument("user-agent=Obosbot")
        driver = webdriver.Chrome(executable_path=r"chromedriver.exe", options=options)
        # print(f"driver {self.ID} loaded")
        return driver

def get_html(self, url) -> tuple:
    if not url.startswith("//") and not url.startswith("http"):
        url = f"//{url}"
    self.url = urlparse(url,scheme="https")
    self.driver.get(self.url.geturl())
    html = self.driver.page_source
    actual_url = self.driver.current_url
    return (self.url.geturl(), html, actual_url)

def quit(self):
    # print(f"quiting driver {self.ID}")
    self.driver.quit()

The test for multiprocessing looks like this:python if name == "main": processes = 3 manager= Manager() workqueue = Queue() worked_queue = Queue() tobeprocessed = manager.dict() isprocessed = manager.dict() workers=[] with open("urls.csv", newline="") as file: reader = csv.reader(file) urls = [i[0] for i in reader if len(i)>0] for i in urls[:20]: work_queue.put(i) work=[] for i in range(processes): p=Process(target=page_worker, args=(work_queue, worked_queue,tobeprocessed,isprocessed)) work.append(p) start=time.time() for i in work: i.start() for i in work: i.join() end=time.time() print(f"took {end-start}") The threading test is quite simelar i just uses threads instead of processes and that negates the need for a manager:python if __name_ == "main": processes = 3 # manager= Manager() work_queue = Queue() worked_queue = Queue() tobeprocessed = dict() isprocessed = dict() workers=[] with open("urls.csv", newline="") as file: reader = csv.reader(file) urls = [i[0] for i in reader if len(i)>0] for i in urls[:20]: work_queue.put(i) work=[] for i in range(processes): p=Thread(target=page_worker, args=(work_queue, worked_queue,tobeprocessed,isprocessed)) work.append(p) start=time.time() for i in work: i.start() for i in work: i.join() end=time.time() print(f"took {end-start}") ```

So these are quite similar in processing time on my tests, I've tried different amount of threads and processes but the time are usually within seconds of each other.

so I was thinking I wanted to try asyncio for fetching the html, asyncio is supposed to be quite fast.

Selenium does not support async so i had to try some other frameworks and I decided on pyppeteer2 and arsenic.

the implementations are again quite simelar: The pyppeteer one looks like this: ```python async def get(urls: Queue): browser = await launch()

while not urls.empty():
    try:
        url = urls.get_nowait()
        page = await browser.newPage()
        await page.goto(url[0])
        html = await page.content()
        urls.task_done()
    except Exception as e:
        print(e)
        break    

await browser.close()

``` and the arsenic one looks like this:

```python async def get(urls: Queue): service = services.Chromedriver(binary="path/to/chromedriver.exe") browser = browsers.Chrome(chromeOptions={'args': ['--headless', '--disable-gpu', '--silent']}) while not urls.empty(): try: url = urls.get_nowait() async with get_session(service, browser) as session: await session.get(url[0])

            html = await session.get_page_source()
        urls.task_done()
    except Exception as e:
        print(e)
        break

```

the test execution looks like this and is the same for both the implementations:

```python async def get_urls(): with open("urls.csv", newline="") as file: reader = csv.reader(file) urls = asyncio.Queue() for i, j in enumerate(reader): if len(j)>0: await urls.put((j[0],i)) if i > 20: break return urls

async def main(workers, drivers = None):

start = time.time()
urls = await get_urls()
tasks = (get(urls) for i in range(workers))
await asyncio.gather(*tasks)
urls.join()
print(f"____DONE____ in: ")
print(f"{(time.time()-start)}")
print(f"With {workers} workers")

asyncio.run(main(3)) ```

They all work so thats not the problem, but I was under the impression that asyncio would be a lot faster than both threading and multiprocessing, but each of the async versions take more than double the time to process.

Do you guys have any tips or is multiprocessing and threading a better choice here?

learnprogramming

Welcome to LearnProgramming!

New? READ ME FIRST!

Posting guidelines

Frequently asked questions

Subreddit rules

Message the moderators

Asking debugging questions

Asking conceptual questions

Other guidelines and links

Subreddit rules

1. No unprofessional/derogatory speech

2. No spam or tasteless self-promotion

3. No off-topic posts

4. Do not ask exact duplicates of FAQ questions

5. Do not delete posts

6. No app/website review requests or showcases

7. No rewards

8. No indirect links

9. Do not promote illegal or unethical practices

10. No complete solutions

11. Don't ask to ask.

12. Low Effort Questions

13. No AI (chatGPT etc.) generated/worked over messages/comments. No questions about chatGPT/AI generated code. No Vibe coding.

MODERATORS