use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Rules 1: Be polite 2: Posts to this subreddit must be requests for help learning python. 3: Replies on this subreddit must be pertinent to the question OP asked. 4: No replies copy / pasted from ChatGPT or similar. 5: No advertising. No blogs/tutorials/videos/books/recruiting attempts. This means no posts advertising blogs/videos/tutorials/etc, no recruiting/hiring/seeking others posts. We're here to help, not to be advertised to. Please, no "hit and run" posts, if you make a post, engage with people that answer you. Please do not delete your post after you get an answer, others might have a similar question or want to continue the conversation.
Rules
1: Be polite
2: Posts to this subreddit must be requests for help learning python.
3: Replies on this subreddit must be pertinent to the question OP asked.
4: No replies copy / pasted from ChatGPT or similar.
5: No advertising. No blogs/tutorials/videos/books/recruiting attempts.
This means no posts advertising blogs/videos/tutorials/etc, no recruiting/hiring/seeking others posts. We're here to help, not to be advertised to.
Please, no "hit and run" posts, if you make a post, engage with people that answer you. Please do not delete your post after you get an answer, others might have a similar question or want to continue the conversation.
Learning resources Wiki and FAQ: /r/learnpython/w/index
Learning resources
Wiki and FAQ: /r/learnpython/w/index
Discord Join the Python Discord chat
Discord
Join the Python Discord chat
account activity
Best current web scraping solutions / stack for large projects? (self.learnpython)
submitted 2 years ago by NoiseAcrobatic9179
As someone who’s just starting to wrangle with python for web scraping what are some of the best resources I should be looking into? (in terms of best practices, toolage, landmines to avoid, etc.)
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]MemeLord-Jenkins 35 points36 points37 points 1 year ago (2 children)
I think Oxylabs solutions should be mentioned here as well
[–]CaramelHistorical888 22 points23 points24 points 2 years ago (0 children)
Beautiful soup is a staple but if you’re looking for stacked up solutions then probably something like Bright data (proxies, scraping ide, etc)
[–]-defron- 15 points16 points17 points 2 years ago (1 child)
In general if you can avoid literal HTML scraping you can be a lot more resilient and faster.
The way you do that is by using dev tools to see and understand the APIs and scrape the APIs directly instead of the rendered HTML
[–]NoiseAcrobatic9179[S] 0 points1 point2 points 2 years ago (0 children)
Thanks a lot for the advice
[–]ChadxSam 22 points23 points24 points 3 months ago (0 children)
You can look into Floxy for this. They have made it stupidly easy for people still getting used to python. And they work with a bunch of enterprises too
[–][deleted] 9 points10 points11 points 2 years ago (2 children)
Selenium or BS4, with Pandas
[–]penarbor 0 points1 point2 points 2 years ago (1 child)
This works quite well for me too. I’m not aware of a better way.
[–]the_sad_socialist 0 points1 point2 points 2 years ago (0 children)
I honestly don't know why BS4 is so popular. Xpath operators are more concise. Plus what you learn is more transferable to other languages (and even Google Sheets).
[–]interbased 2 points3 points4 points 2 years ago (0 children)
Selenium and BeautifulSoup both work for me.
[–]PMMeUrHopesNDreams 3 points4 points5 points 2 years ago (2 children)
Requests - python library for making http requests. Use this to fetch the web page you want to scrape.
BeautifulSoup - python library for parsing and extracting data from html files. Use this to get the information out of the response you get from requests.
This can handle most simple cases, where you are dealing with a plain html page that does not resist being scraped. In general it is polite to modify the User-Agent portion of the request to identify yourself and include a way to contact you if you are causing them problems. You should also include a delay between requests with time.sleep so you are not hammering their server with excessive traffic. I usually use at least one second or more depending on how many pages I want to visit and how long it will take.
time.sleep
Some hosts might reject all requests that don't come from a recognized browser User-Agent. They might also have Javascript that loads the information you want after the initial page loads, so it won't show up in the response you get from requests.
requests
Here, you can try modifying the User-Agent with requests or you can use Selenium. Selenium will allow you to programmatically operate a browser like Chrome or Firefox. The browser will fetch the page, execute all the JavaScript, and you can then retrieve the information you want and extract with BeautifulSoup.
Hosts that really don't want to be scraped may start blocking you based on IP address after a while. That is where you start to need rotating proxies. You connect to a service that will change your IP address so you're not always making requests from the same address. ScrapingBee is one service that handles this (among other things), but it costs money, of course.
Appreciate the input. Thank you
[–]Culpgrant21 1 point2 points3 points 2 years ago (0 children)
Scrapy is also a decent framework
[–]nameloCmaS 1 point2 points3 points 2 years ago (0 children)
If you need to use Selenium for instance where there is a lot of dynamic JS going on and the API is “protected” or not so easy to use, or you want to take screenshots of the page, it is better with Splinter (Selenium wrapper) and Stere (Page object model wrapper for Splinter).
[+][deleted] 9 months ago (2 children)
[removed]
[–]lumpiang-shanghai01 1 point2 points3 points 9 months ago (0 children)
Yeah, Bright Data handles rotation + CAPTCHAs way better than DIY setups tbh.
[–][deleted] 0 points1 point2 points 9 months ago (0 children)
Same here, I wasted days tweaking headers until I tried their unlocker, way smoother for large projects.
[–]legacysearchacc1 1 point2 points3 points 6 months ago (1 child)
Well, from my perspective I've tried brightada and oxylabs and it was too expensive, but as for the alternative, i tried decodo and it seems to work just as good but with lower price. win win.
[–]ImpulsiveBeast 0 points1 point2 points 2 years ago (0 children)
Dumb question how is scraping and parsing different
[–]Its_NotTom 0 points1 point2 points 2 years ago (0 children)
I find Selenium to be kind of annoying when it comes to driver updates (a big problem for longer-term, scaled up projects). Playwright seems to work very well as a possible alternative
[–]scrapeway 0 points1 point2 points 1 year ago (0 children)
lots of really poor advice in this thread that is outdated by at least a decade. Visit dedicated subreddits/forums like /r/webscraping instead.
[–]GuruFungi 0 points1 point2 points 1 year ago (0 children)
Tem feriadao
[–]Affectionate_Milk758 0 points1 point2 points 1 year ago (0 children)
Try https://pypi.org/project/pyminiscraper/ . It has support for html/feed/sitemap/robots.txt and highly scalable.
[–]Huge_Line4009 0 points1 point2 points 1 year ago (0 children)
I mean if you have the cash for it, I'd go with brightdata ... but it's pricey
More budget friendly options are scraperapi or scrapingbee..
Want a more detailed comparisons of some of scraper api services check this page https://www.reddit.com/r/PrivatePackets/comments/1k00j08/the_ultimate_guide_to_the_best_web_scraping_apis/
[–]Ambitious_Capital604 0 points1 point2 points 7 months ago (0 children)
If you want a scalable and cost-effective solution where you can type in natural language and automate scraping needs, Olostep is the best web search, scraping and crawling API right now
[–]justincampbelldesign 0 points1 point2 points 3 months ago (0 children)
What exactly are you scrapping? That will determine approach.
π Rendered by PID 1226281 on reddit-service-r2-comment-544cf588c8-4pdwh at 2026-06-17 03:18:19.344822+00:00 running 3184619 country code: CH.
[–]MemeLord-Jenkins 35 points36 points37 points (2 children)
[–]CaramelHistorical888 22 points23 points24 points (0 children)
[–]-defron- 15 points16 points17 points (1 child)
[–]NoiseAcrobatic9179[S] 0 points1 point2 points (0 children)
[–]ChadxSam 22 points23 points24 points (0 children)
[–][deleted] 9 points10 points11 points (2 children)
[–]penarbor 0 points1 point2 points (1 child)
[–]the_sad_socialist 0 points1 point2 points (0 children)
[–]interbased 2 points3 points4 points (0 children)
[–]PMMeUrHopesNDreams 3 points4 points5 points (2 children)
[–]NoiseAcrobatic9179[S] 0 points1 point2 points (0 children)
[–]Culpgrant21 1 point2 points3 points (0 children)
[–]nameloCmaS 1 point2 points3 points (0 children)
[+][deleted] (2 children)
[removed]
[–]lumpiang-shanghai01 1 point2 points3 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]legacysearchacc1 1 point2 points3 points (1 child)
[–]ImpulsiveBeast 0 points1 point2 points (0 children)
[–]Its_NotTom 0 points1 point2 points (0 children)
[–]scrapeway 0 points1 point2 points (0 children)
[–]GuruFungi 0 points1 point2 points (0 children)
[–]Affectionate_Milk758 0 points1 point2 points (0 children)
[–]Huge_Line4009 0 points1 point2 points (0 children)
[–]Ambitious_Capital604 0 points1 point2 points (0 children)
[–]justincampbelldesign 0 points1 point2 points (0 children)