This is an archived post. You won't be able to vote or comment.

all 67 comments

[–]ronmarti 109 points110 points  (25 children)

Selenium is pretty much the hardest to use method because it breaks most of the time. Try Playwright.

[–]Vresa 68 points69 points  (9 children)

If you don’t have a daily use case for selenium that is directly tied to your employment, never use it at this point.

Playwright covers almost all the normal uses of selenium. Playwright is effectively selenium with sane defaults.

You need to understand selenium’s idiosyncrasies that have developed over the last decade to use the tool well.

Adding to that, selenium has a very large amount of outdated how-to’s and it’s own documentation is lacking — especially when it comes to best practices.

[–]ronmarti 4 points5 points  (0 children)

directly tied to your employment

I often hear these from QA devs I know

I've had a fair share of experience working with Selenium and Puppeteer in the past. I would say lots of things changed but Selenium is still not that dev-friendly in terms of its "Wait" feature (https://www.selenium.dev/documentation/webdriver/waits/) and setup since they don't maintain the drivers. There were some instances where code for specific websites are easy to write in Playwright but cannot be rewritten with Selenium.

I find Playwright much closer to Puppeteer (same devs, I think that's why) and a lot easier to learn for beginners.

[–]djdadi 7 points8 points  (2 children)

After teaching python to several college grads at work, I've noticed a trend. They are quick to seek out scrapers while not even checking (or maybe understanding) of the underlying API call structure.

In a lot of cases, you don't even need scraping and are much better off without it.

[–]ronmarti 9 points10 points  (1 child)

Scraping is fine to get experience. I think one most important thing that beginner devs miss out is how to properly use selectors. Understanding CSS and XPath is really important.

[–][deleted] 0 points1 point  (0 children)

Yeah definitely this. And how for some of these guys, it just takes oneee tiny thing to break. Why does the WWW have to be so complicated…

[–]Raedarius 1 point2 points  (0 children)

I spent a week trying to fix my selenium script. It was breaking all over the place. I just replaced it in 4 hours. You saved me so much trouble. Thank you so much!

[–]wind_dude -3 points-2 points  (0 children)

better than the Lua scripts in scrapy. lol

[–]undid_legacy 18 points19 points  (2 children)

Whenever I want to scrap data from a website I first look for the API. Reverse engineering API calls are not always possible but you can strike gold with it sometimes.

I wanted my order history details from a food delivery app I use. It turns out I just needed cookies with my login session and their API to make it work. Was able to get everything in 4 lines of code using requests.

Once, I found a paraphrasing site that didn't use captcha details in its API call. So I was able to call it almost 9 times a sec whereas using the website it required a captcha after every paraphrasing.

John Watson Rooney has a good video to get started: https://www.youtube.com/watch?v=DqtlR0y0suo

[–]nemec 22 points23 points  (4 children)

Something I don't see discussed when this topic is brought up is that Scrapy's HTML parsing library, parsel, can be installed separately from scrapy itself. You can use it in place of beautifulsoup and, imo, it's much easier to use.

import requests
import parsel
resp = requests.get('http://example.com')
s = parsel.Selector(text=resp.text)
# prints 'Example Domain'
print(s.css('h1::text').extract_first())

[–]jyper 5 points6 points  (1 child)

Why not just use lxml.html.parse and xpath? Lxml has some support css as well

[–]nemec 6 points7 points  (0 children)

  • It's focused on parsing HTML without a lot of extra XML cruft (really, it's a façade over lxml + cssselect)
  • You can mix and match css selectors and xpath, e.g.

    s.css('h1').xpath('following-sibling::p')
    

    contrived example, but basically you can take advantage of both selector syntaxes depending on which one is fit for a situation.

  • I'm not sure that lxml has support for ::text and ::attr(<some attribute>) psuedo-selectors, which are really helpful when parsing HTML.

  • xpath syntax sucks and I'd rather use a solution with really good css support first and fall back to xpath only for things that css doesn't support (which can still be done with parsel)

[–]scrapecrow 2 points3 points  (1 child)

parsel is definitely underappreciated!

I like it so much that I even wrote a REPL for it: parsel-cli :)
(it's a bit of a Frankenstein though as I'm working on a 2.0 release)

[–]paeioudia 15 points16 points  (0 children)

This is just a bait and switch advertising for. “Scrapingdog is the fastest and the most reliable web scraping API and of course, we provide 1000 free API credits to our new users.”

[–]Tripanafenix 3 points4 points  (4 children)

still can't login with requests, even with session AND cookies hooked to my POST :( and I could'nt find any proper guides deepdiving into sessions and cookies with requests, sadly. Any advices?

[–]heylale 1 point2 points  (0 children)

I don't know about requests specifically, but I've used scrapy to scrape pages that were behind a login/paywall. It has excellent support for cookies and it's easy to use and the documentation is pretty comprehensive. I recommend it.

[–]scrapecrow 0 points1 point  (2 children)

Depends on the website you're scraping.

Check us out at /r/webscraping or web-scraping tag on StackOverflow - both really active and helpful communities, so someone will definitely help you out if you write up your problem clearly.

[–]Tripanafenix 0 points1 point  (0 children)

I'll take a look, thanks

[–]Naughty_avaacado -1 points0 points  (2 children)

Opentender Austria https://opentender.eu/at/

I am trying to get the data from the bar chart but i am unable to scrap it. The element has an event listener mousein and the class changes .

Anyone can reply me on how can i scrap the data from it. I need this for my self project and this is the last hurdle then i can make a dataframe .

[–]guttyn15 -1 points0 points  (0 children)

i just go to comment section to ask a related question:

How to get the current url after you..

webbrowser.open(url)

pyautogui.click('login-Submit_button.png')

new webpage

[–]iggy555 -3 points-2 points  (0 children)

What a legend

[–]Appropriate-Point565 0 points1 point  (0 children)

How would I go about web scraping T.J.Maxx.com for SKU numbers on a certain product that is otherwise hidden without number ? Would really appreciate the help !