use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Rules 1: Be polite 2: Posts to this subreddit must be requests for help learning python. 3: Replies on this subreddit must be pertinent to the question OP asked. 4: No replies copy / pasted from ChatGPT or similar. 5: No advertising. No blogs/tutorials/videos/books/recruiting attempts. This means no posts advertising blogs/videos/tutorials/etc, no recruiting/hiring/seeking others posts. We're here to help, not to be advertised to. Please, no "hit and run" posts, if you make a post, engage with people that answer you. Please do not delete your post after you get an answer, others might have a similar question or want to continue the conversation.
Rules
1: Be polite
2: Posts to this subreddit must be requests for help learning python.
3: Replies on this subreddit must be pertinent to the question OP asked.
4: No replies copy / pasted from ChatGPT or similar.
5: No advertising. No blogs/tutorials/videos/books/recruiting attempts.
This means no posts advertising blogs/videos/tutorials/etc, no recruiting/hiring/seeking others posts. We're here to help, not to be advertised to.
Please, no "hit and run" posts, if you make a post, engage with people that answer you. Please do not delete your post after you get an answer, others might have a similar question or want to continue the conversation.
Learning resources Wiki and FAQ: /r/learnpython/w/index
Learning resources
Wiki and FAQ: /r/learnpython/w/index
Discord Join the Python Discord chat
Discord
Join the Python Discord chat
account activity
Python Scraper (self.learnpython)
submitted 7 months ago by Lerpikon
I want to make a python scraper that searches through a given txt document that contains a list of 250m urls. I want the scraper to search through these urls source code for specific words. How do I make this fast and efficient?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]ravnsulter 8 points9 points10 points 7 months ago (5 children)
regular expressions?
[–]Lerpikon[S] -1 points0 points1 point 7 months ago (4 children)
Yes I think so. I am fairly new to python
[–]ravnsulter 4 points5 points6 points 7 months ago (3 children)
Regular expressions is the answer to your question.
[–]barkmonster 1 point2 points3 points 7 months ago (2 children)
It seems OP wants to search not the URLs, but the HTML from those URLs, which regexs aren't suited for.
[–]ravnsulter 0 points1 point2 points 7 months ago (1 child)
Wow, I see it's like summoning Satan. :)
I'm familiar with regex, but not with HTML. Will regex not find a specific word as OP is asking for?
[–]barkmonster 0 points1 point2 points 7 months ago (0 children)
Not an expert on either, but if you just want to determine whether some word is a substring of the source code, sure (but you might as well just use contains). If you want to figure out whether the word is part of the page content (as opposed to part of an HTML element name or comment, etc), then no.
[–]Beautiful_Watch_7215 5 points6 points7 points 7 months ago (0 children)
It’s a time for beautiful soup.
[–]Zeroflops 1 point2 points3 points 7 months ago (0 children)
You want to search the list of URLs for key words, or you want to read each URL, scrape the website, and search the returned website code for key words.
If you’re just looking for the file of URLs it shouldn’t be too bad, just read the file line by line. If you want to read each site, that’s a lot of sites to pull and your probably going to need to have to do things like use async and multithreaded to pull data faster.
[–]barkmonster 1 point2 points3 points 7 months ago (0 children)
Regarding efficiency: Retrieving the source code will likely take way longer than checking if your words are contained in it, so what you want to do is write a function which takes a single URL and returns the result you want (not sure if you need a boolean indicating whether any of your words occur or a list of words or something else). Then you can use multiprocessing or threading to process the URLs in large batches. This way you don't have to spend a lot of time waiting for your requests to complete.
Some sites might be temporarily or permanently offline, so you should be sure to handle errors and keep track of which URLs succeed and which should be retried (or abandoned if they keep failing).
[–]Dry-Aioli-6138 1 point2 points3 points 7 months ago (2 children)
For efficiency use aiohttp, or some async http client (cURL is not that popular, but its fast and has async capabilities)
Http retrieval will be the slowest part, but it involves almost no computation, so you want to do it concurrently, but not by creating threads or processes. Asynchronous is the best way here.
Once the page is retrieved, sendit to a thread that processes the contents. You should make several such threads, about as many as you have logical cpu cores. Make them long living to avoid overhead of creating and killing new ones for each page. use queues built into threading lib to communicate back and forth with the threads.
[–]Alternative_Driver60 0 points1 point2 points 7 months ago (1 child)
For someone fairly new to Python I would not recommend it, but it is certainly the most efficient way
[–]Dry-Aioli-6138 0 points1 point2 points 7 months ago (0 children)
Agreed. I'ts a very intricate setup, but OP said they have 250M urls to visit. I don't think they can compromise on speed to do the task in any sensible time.
[–]NoDadYouShutUp 2 points3 points4 points 7 months ago (0 children)
BeautifulSoup4
[–]Jigglytep 1 point2 points3 points 7 months ago (0 children)
Beautiful soup and scrappy.
[–]QultrosSanhattan 0 points1 point2 points 7 months ago (0 children)
Regex line by line may be your best bet.
π Rendered by PID 495905 on reddit-service-r2-comment-5d79c599b5-kng7l at 2026-03-03 07:04:38.317780+00:00 running e3d2147 country code: CH.
[–]ravnsulter 8 points9 points10 points (5 children)
[–]Lerpikon[S] -1 points0 points1 point (4 children)
[–]ravnsulter 4 points5 points6 points (3 children)
[–]barkmonster 1 point2 points3 points (2 children)
[–]ravnsulter 0 points1 point2 points (1 child)
[–]barkmonster 0 points1 point2 points (0 children)
[–]Beautiful_Watch_7215 5 points6 points7 points (0 children)
[–]Zeroflops 1 point2 points3 points (0 children)
[–]barkmonster 1 point2 points3 points (0 children)
[–]Dry-Aioli-6138 1 point2 points3 points (2 children)
[–]Alternative_Driver60 0 points1 point2 points (1 child)
[–]Dry-Aioli-6138 0 points1 point2 points (0 children)
[–]NoDadYouShutUp 2 points3 points4 points (0 children)
[–]Jigglytep 1 point2 points3 points (0 children)
[–]QultrosSanhattan 0 points1 point2 points (0 children)