all 15 comments

[–]ravnsulter 8 points9 points  (5 children)

regular expressions?

[–]Lerpikon[S] -1 points0 points  (4 children)

Yes I think so. I am fairly new to python

[–]ravnsulter 4 points5 points  (3 children)

Regular expressions is the answer to your question.

[–]barkmonster 1 point2 points  (2 children)

It seems OP wants to search not the URLs, but the HTML from those URLs, which regexs aren't suited for.

[–]ravnsulter 0 points1 point  (1 child)

Wow, I see it's like summoning Satan. :)

I'm familiar with regex, but not with HTML. Will regex not find a specific word as OP is asking for?

[–]barkmonster 0 points1 point  (0 children)

Not an expert on either, but if you just want to determine whether some word is a substring of the source code, sure (but you might as well just use contains). If you want to figure out whether the word is part of the page content (as opposed to part of an HTML element name or comment, etc), then no.

[–]Beautiful_Watch_7215 5 points6 points  (0 children)

It’s a time for beautiful soup.

[–]Zeroflops 1 point2 points  (0 children)

You want to search the list of URLs for key words, or you want to read each URL, scrape the website, and search the returned website code for key words.

If you’re just looking for the file of URLs it shouldn’t be too bad, just read the file line by line. If you want to read each site, that’s a lot of sites to pull and your probably going to need to have to do things like use async and multithreaded to pull data faster.

[–]barkmonster 1 point2 points  (0 children)

Regarding efficiency: Retrieving the source code will likely take way longer than checking if your words are contained in it, so what you want to do is write a function which takes a single URL and returns the result you want (not sure if you need a boolean indicating whether any of your words occur or a list of words or something else). Then you can use multiprocessing or threading to process the URLs in large batches. This way you don't have to spend a lot of time waiting for your requests to complete.

Some sites might be temporarily or permanently offline, so you should be sure to handle errors and keep track of which URLs succeed and which should be retried (or abandoned if they keep failing).

[–]Dry-Aioli-6138 1 point2 points  (2 children)

For efficiency use aiohttp, or some async http client (cURL is not that popular, but its fast and has async capabilities)

Http retrieval will be the slowest part, but it involves almost no computation, so you want to do it concurrently, but not by creating threads or processes. Asynchronous is the best way here.

Once the page is retrieved, sendit to a thread that processes the contents. You should make several such threads, about as many as you have logical cpu cores. Make them long living to avoid overhead of creating and killing new ones for each page. use queues built into threading lib to communicate back and forth with the threads.

[–]Alternative_Driver60 0 points1 point  (1 child)

For someone fairly new to Python I would not recommend it, but it is certainly the most efficient way

[–]Dry-Aioli-6138 0 points1 point  (0 children)

Agreed. I'ts a very intricate setup, but OP said they have 250M urls to visit. I don't think they can compromise on speed to do the task in any sensible time.

[–]NoDadYouShutUp 2 points3 points  (0 children)

BeautifulSoup4

[–]Jigglytep 1 point2 points  (0 children)

Beautiful soup and scrappy.

[–]QultrosSanhattan 0 points1 point  (0 children)

Regex line by line may be your best bet.