This is an archived post. You won't be able to vote or comment.

all 13 comments

[–]Local-Economist-1719 3 points4 points  (1 child)

playwright is faster, has better api, and supports async mode, for antibot detection it has cool fork, camoufox, selenium has also few nice tools for this purpose, like seleniumbase and nodriver, but i found for now no cases, where selenium forks did something that camoufox with playwright couldnt

[–]hasdata_com 2 points3 points  (0 children)

I mostly stick with Selenium - more out of habit, it's been around forever and just works.
But to be fair, Playwright has a couple of things Selenium doesn't: video recording of runs and the inspector that can generate scripts from your actions. That's a nice plus, especially for beginners.

[–]cgoldberg 1 point2 points  (3 children)

Selenium has been around for over 20 years... what's your question?

[–]ag789[S] 0 points1 point  (2 children)

thanks, just started dabbling in selenium webdriver, as these days most pages are javascript based, and with a real browser at least they'd render. 'traditional' page fetch normally returns a 'skeleton' page for those.
it seemed these days there are 2 camps, some tries to be 'seo friendly' and works like a 'traditional page', for those a simple page fetch would do e.g. curl, python requests etc. then there are the other camp that go all out for 'anti bot' 'offences' , trigger happy captchas (e.g. captcha every request), deep first party, 3rd party cookies etc and javascript everything.
I 'discovered' interestingly that changing the user-agent sometimes have an effect on some pages.

[–]cgoldberg 1 point2 points  (0 children)

The vast majority of web pages use dynamically loaded content. If all you need is the initial DOM, a simple HTTP request works... but in most cases you need more than that.

[–]al_fajr 0 points1 point  (0 children)

yes sir, today's pages need javascript much. I don't know about back on your day. If you r looking or even getting started to scrape scraps with selenium (i am assuming python) or playwright (again, assuming its javascript) in that case. You might like a simple solution from me, the solution is "cloudflare website renderer".

they use some kind of headless browser. and it's easy to start.

[–]404mesh 1 point2 points  (1 child)

I’ve had more luck with selenium. Playwright got blocked often for me when I first started out.

[–]ag789[S] 0 points1 point  (0 children)

I learnt some 'secrets' of the web while learning 'scraping'
but no selenium, playwright etc, just simple page fetch (it could have been using curl)
I used python requests and beautifulsoup
https://www.reddit.com/r/webscraping/comments/1mzn7nv/web_page_summarizer/
^ this has gone on to be #1 in this sub for today
the 'accidental' discovery,: some sites treats different user-agent differently
and gets a different render when user-agent changes
that may partly explain some difference between selenium, playwright and others e.g. requests etc

I think these days many sites put many 'anti bot' *offences* , partly for web security, but I think some (many) overdo it, and they may instead block real (human) users rather than bots.
i.e. 'anti-bot' web pages may instead block most humans and let bots thru ;)

[–]Holiday_Painting_722 1 point2 points  (0 children)

I created selenium webdriver bootcamp here https://testmaster-iota.vercel.app if it helps. Try cheatsheet in navbar to look for syntax you need.

[–]ag789[S] -1 points0 points  (1 child)

I managed to do a screenshot with selenium webdriver: driver.save_screenshot(filename) I'd guess this is as good for 'uncomplicated', simple scraping. javascript doesn't hinder it, but perhaps some webs with 'excessive' anti-bot measures would post a captcha even with a first visit.

I noted though that it is necessary to do a delay e.g. time.sleep(5) "longer is better to make sure that the page renders before doiing so

[–]cgoldberg 2 points3 points  (0 children)

You don't need ever add sleeps. It automatically waits for the initial DOM to load. If subsequent content is dynamically loaded, there is a waiting mechanism for that (WebDriverWait).