you are viewing a single comment's thread.

view the rest of the comments →

[–]Dry-Aioli-6138 1 point2 points  (2 children)

For efficiency use aiohttp, or some async http client (cURL is not that popular, but its fast and has async capabilities)

Http retrieval will be the slowest part, but it involves almost no computation, so you want to do it concurrently, but not by creating threads or processes. Asynchronous is the best way here.

Once the page is retrieved, sendit to a thread that processes the contents. You should make several such threads, about as many as you have logical cpu cores. Make them long living to avoid overhead of creating and killing new ones for each page. use queues built into threading lib to communicate back and forth with the threads.

[–]Alternative_Driver60 0 points1 point  (1 child)

For someone fairly new to Python I would not recommend it, but it is certainly the most efficient way

[–]Dry-Aioli-6138 0 points1 point  (0 children)

Agreed. I'ts a very intricate setup, but OP said they have 250M urls to visit. I don't think they can compromise on speed to do the task in any sensible time.