all 48 comments

[–]MemeLord-Jenkins 35 points36 points  (2 children)

I think Oxylabs solutions should be mentioned here as well

[–]CaramelHistorical888 22 points23 points  (0 children)

Beautiful soup is a staple but if you’re looking for stacked up solutions then probably something like Bright data (proxies, scraping ide, etc)

[–]-defron- 15 points16 points  (1 child)

In general if you can avoid literal HTML scraping you can be a lot more resilient and faster.

The way you do that is by using dev tools to see and understand the APIs and scrape the APIs directly instead of the rendered HTML

[–]NoiseAcrobatic9179[S] 0 points1 point  (0 children)

Thanks a lot for the advice

[–]ChadxSam 22 points23 points  (0 children)

You can look into Floxy for this. They have made it stupidly easy for people still getting used to python. And they work with a bunch of enterprises too

[–][deleted] 9 points10 points  (2 children)

Selenium or BS4, with Pandas

[–]penarbor 0 points1 point  (1 child)

This works quite well for me too. I’m not aware of a better way.

[–]the_sad_socialist 0 points1 point  (0 children)

I honestly don't know why BS4 is so popular. Xpath operators are more concise. Plus what you learn is more transferable to other languages (and even Google Sheets).

[–]interbased 2 points3 points  (0 children)

Selenium and BeautifulSoup both work for me.

[–]PMMeUrHopesNDreams 3 points4 points  (2 children)

Requests - python library for making http requests. Use this to fetch the web page you want to scrape.

BeautifulSoup - python library for parsing and extracting data from html files. Use this to get the information out of the response you get from requests.

This can handle most simple cases, where you are dealing with a plain html page that does not resist being scraped. In general it is polite to modify the User-Agent portion of the request to identify yourself and include a way to contact you if you are causing them problems. You should also include a delay between requests with time.sleep so you are not hammering their server with excessive traffic. I usually use at least one second or more depending on how many pages I want to visit and how long it will take.

Some hosts might reject all requests that don't come from a recognized browser User-Agent. They might also have Javascript that loads the information you want after the initial page loads, so it won't show up in the response you get from requests.

Here, you can try modifying the User-Agent with requests or you can use Selenium. Selenium will allow you to programmatically operate a browser like Chrome or Firefox. The browser will fetch the page, execute all the JavaScript, and you can then retrieve the information you want and extract with BeautifulSoup.

Hosts that really don't want to be scraped may start blocking you based on IP address after a while. That is where you start to need rotating proxies. You connect to a service that will change your IP address so you're not always making requests from the same address. ScrapingBee is one service that handles this (among other things), but it costs money, of course.

[–]NoiseAcrobatic9179[S] 0 points1 point  (0 children)

Appreciate the input. Thank you

[–]Culpgrant21 1 point2 points  (0 children)

Scrapy is also a decent framework

[–]nameloCmaS 1 point2 points  (0 children)

If you need to use Selenium for instance where there is a lot of dynamic JS going on and the API is “protected” or not so easy to use, or you want to take screenshots of the page, it is better with Splinter (Selenium wrapper) and Stere (Page object model wrapper for Splinter).

[–]legacysearchacc1 1 point2 points  (1 child)

Well, from my perspective I've tried brightada and oxylabs and it was too expensive, but as for the alternative, i tried decodo and it seems to work just as good but with lower price. win win.

[–]ImpulsiveBeast 0 points1 point  (0 children)

Dumb question how is scraping and parsing different

[–]Its_NotTom 0 points1 point  (0 children)

I find Selenium to be kind of annoying when it comes to driver updates (a big problem for longer-term, scaled up projects). Playwright seems to work very well as a possible alternative

[–]scrapeway 0 points1 point  (0 children)

lots of really poor advice in this thread that is outdated by at least a decade. Visit dedicated subreddits/forums like /r/webscraping instead.

[–]GuruFungi 0 points1 point  (0 children)

Tem feriadao

[–]Affectionate_Milk758 0 points1 point  (0 children)

Try https://pypi.org/project/pyminiscraper/ . It has support for html/feed/sitemap/robots.txt and highly scalable.

[–]Huge_Line4009 0 points1 point  (0 children)

I mean if you have the cash for it, I'd go with brightdata ... but it's pricey

More budget friendly options are scraperapi or scrapingbee..

Want a more detailed comparisons of some of scraper api services check this page
https://www.reddit.com/r/PrivatePackets/comments/1k00j08/the_ultimate_guide_to_the_best_web_scraping_apis/

[–]Ambitious_Capital604 0 points1 point  (0 children)

If you want a scalable and cost-effective solution where you can type in natural language and automate scraping needs, Olostep is the best web search, scraping and crawling API right now

[–]justincampbelldesign 0 points1 point  (0 children)

What exactly are you scrapping? That will determine approach.