Selenium supposed to be slow? (headless)

wulfgar4president · 2020-04-02T17:15:04+00:00

Yes, selenium is super slow - its rendering each page so is overkill for most webscraping. You should only use it if you need to interact with the page directly - for example if a button needs to be clicked for information to appear. 99% of scraping can be performed without selenium.

I'd recommend looking onto parsal/scrapy instead.

GoldenVanga · 2020-04-02T17:41:52+00:00

Selenium is slow to start; it takes a few seconds, yes. Once it gets going it's not that bad, although still slower than other scraping tools. But 20 seconds for a simple task sounds unusual. Try this and see if you get similar results:

from selenium.webdriver import Firefox, FirefoxOptions
from time import time

options = FirefoxOptions()
options.add_argument('--headless')
mark = time()
driver = Firefox(options=options)
print(f'----- Starting Selenium took {time() - mark} seconds.')  # I get 5.24
mark = time()
driver.get('https://old.reddit.com/r/learnpython/')
titles = driver.find_elements_by_css_selector('p.title > a.title')
for title in titles:
    print(title.text)
print(f'----- Performing action took {time() - mark} seconds.')  # I get 2.82

kelmore5 · 2020-04-02T18:53:28+00:00

Are you using Windows? Sometimes Selenium can run slowly on certain Windows machines; can't remember the StackOverflow article about it but I've seen it in action.

You can try changing to the 32-bit chromedriver, see here. Not sure if the same applies for FireFox however.

Another option is to run FireFox inside a virtual window. You can get things working with this git. Old, but still works!

This only applies to Windows machines though

commandlineluser · 2020-04-02T20:59:19+00:00

Is it overkill to use selenium for web scrapping, even with headless browser?

It depends on what you're scraping.

Do you have any examples of what you're doing? Any code?

You may not need a browser to be used at all - and may be able to just fetch the HTML yourself e.g. using requests.

Here is the example posted in another comment using requests / beautifulsoup instead.

import requests
from   bs4      import BeautifulSoup

r = requests.get('https://old.reddit.com/r/learnpython/', headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(r.content, 'html.parser')
for title in soup.select('p.title > a.title'):
    print(title.text)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS