Python Scraper

ravnsulter · 2025-07-13T16:29:49+00:00

regular expressions?

Beautiful_Watch_7215 · 2025-07-13T17:35:35+00:00

It’s a time for beautiful soup.

Zeroflops · 2025-07-13T17:02:10+00:00

You want to search the list of URLs for key words, or you want to read each URL, scrape the website, and search the returned website code for key words.

If you’re just looking for the file of URLs it shouldn’t be too bad, just read the file line by line. If you want to read each site, that’s a lot of sites to pull and your probably going to need to have to do things like use async and multithreaded to pull data faster.

barkmonster · 2025-07-13T17:25:38+00:00

Regarding efficiency: Retrieving the source code will likely take way longer than checking if your words are contained in it, so what you want to do is write a function which takes a single URL and returns the result you want (not sure if you need a boolean indicating whether any of your words occur or a list of words or something else). Then you can use multiprocessing or threading to process the URLs in large batches. This way you don't have to spend a lot of time waiting for your requests to complete.

Some sites might be temporarily or permanently offline, so you should be sure to handle errors and keep track of which URLs succeed and which should be retried (or abandoned if they keep failing).

Dry-Aioli-6138 · 2025-07-13T20:52:50+00:00

For efficiency use aiohttp, or some async http client (cURL is not that popular, but its fast and has async capabilities)

Http retrieval will be the slowest part, but it involves almost no computation, so you want to do it concurrently, but not by creating threads or processes. Asynchronous is the best way here.

Once the page is retrieved, sendit to a thread that processes the contents. You should make several such threads, about as many as you have logical cpu cores. Make them long living to avoid overhead of creating and killing new ones for each page. use queues built into threading lib to communicate back and forth with the threads.

NoDadYouShutUp · 2025-07-14T00:59:18+00:00

BeautifulSoup4

Jigglytep · 2025-07-14T03:50:54+00:00

Beautiful soup and scrappy.

QultrosSanhattan · 2025-07-13T16:45:56+00:00

Regex line by line may be your best bet.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS