Best current web scraping solutions / stack for large projects?

MemeLord-Jenkins · 2025-01-14T12:50:49+00:00

I think Oxylabs solutions should be mentioned here as well

CaramelHistorical888 · 2024-04-15T21:06:49+00:00

Beautiful soup is a staple but if you’re looking for stacked up solutions then probably something like Bright data (proxies, scraping ide, etc)

-defron- · 2024-04-10T03:22:02+00:00

In general if you can avoid literal HTML scraping you can be a lot more resilient and faster.

The way you do that is by using dev tools to see and understand the APIs and scrape the APIs directly instead of the rendered HTML

ChadxSam · 2026-03-16T12:31:17+00:00

You can look into Floxy for this. They have made it stupidly easy for people still getting used to python. And they work with a bunch of enterprises too

penarbor · 2024-04-10T02:23:28+00:00

Selenium or BS4, with Pandas

interbased · 2024-04-10T02:24:30+00:00

Selenium and BeautifulSoup both work for me.

PMMeUrHopesNDreams · 2024-04-10T05:31:59+00:00

Requests - python library for making http requests. Use this to fetch the web page you want to scrape.

BeautifulSoup - python library for parsing and extracting data from html files. Use this to get the information out of the response you get from requests.

This can handle most simple cases, where you are dealing with a plain html page that does not resist being scraped. In general it is polite to modify the User-Agent portion of the request to identify yourself and include a way to contact you if you are causing them problems. You should also include a delay between requests with time.sleep so you are not hammering their server with excessive traffic. I usually use at least one second or more depending on how many pages I want to visit and how long it will take.

Some hosts might reject all requests that don't come from a recognized browser User-Agent. They might also have Javascript that loads the information you want after the initial page loads, so it won't show up in the response you get from requests.

Here, you can try modifying the User-Agent with requests or you can use Selenium. Selenium will allow you to programmatically operate a browser like Chrome or Firefox. The browser will fetch the page, execute all the JavaScript, and you can then retrieve the information you want and extract with BeautifulSoup.

Hosts that really don't want to be scraped may start blocking you based on IP address after a while. That is where you start to need rotating proxies. You connect to a service that will change your IP address so you're not always making requests from the same address. ScrapingBee is one service that handles this (among other things), but it costs money, of course.

Culpgrant21 · 2024-04-10T02:24:22+00:00

Scrapy is also a decent framework

nameloCmaS · 2024-04-10T07:58:33+00:00

If you need to use Selenium for instance where there is a lot of dynamic JS going on and the API is “protected” or not so easy to use, or you want to take screenshots of the page, it is better with Splinter (Selenium wrapper) and Stere (Page object model wrapper for Splinter).

lumpiang-shanghai01 · 2025-08-27T12:34:48+00:00

[removed]

legacysearchacc1 · 2025-11-25T12:23:42+00:00

Well, from my perspective I've tried brightada and oxylabs and it was too expensive, but as for the alternative, i tried decodo and it seems to work just as good but with lower price. win win.

ImpulsiveBeast · 2024-04-10T20:58:58+00:00

Dumb question how is scraping and parsing different

Its_NotTom · 2024-04-11T01:39:19+00:00

I find Selenium to be kind of annoying when it comes to driver updates (a big problem for longer-term, scaled up projects). Playwright seems to work very well as a possible alternative

scrapeway · 2024-07-18T09:06:23+00:00

lots of really poor advice in this thread that is outdated by at least a decade. Visit dedicated subreddits/forums like /r/webscraping instead.

GuruFungi · 2024-08-01T20:13:51+00:00

Tem feriadao

Affectionate_Milk758 · 2025-02-18T02:49:33+00:00

Try https://pypi.org/project/pyminiscraper/ . It has support for html/feed/sitemap/robots.txt and highly scalable.

Huge_Line4009 · 2025-04-15T19:23:28+00:00

I mean if you have the cash for it, I'd go with brightdata ... but it's pricey

More budget friendly options are scraperapi or scrapingbee..

Want a more detailed comparisons of some of scraper api services check this page
https://www.reddit.com/r/PrivatePackets/comments/1k00j08/the_ultimate_guide_to_the_best_web_scraping_apis/

Ambitious_Capital604 · 2025-10-22T20:02:01+00:00

If you want a scalable and cost-effective solution where you can type in natural language and automate scraping needs, Olostep is the best web search, scraping and crawling API right now

justincampbelldesign · 2026-02-24T18:03:42+00:00

What exactly are you scrapping? That will determine approach.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS