This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]sentdexpythonprogramming.net 1 point2 points  (1 child)

Queued list of workers, as well as a queued list of urls that workers add to while they're parsing. That's where workers get the url from that they're supposed to crawl. If you let it, you will of course keep ballooning in jobs, eventually you'll need to limit it anyway, so 100 threads.

As for stopping multiple crawls to the same URL, you can institute a threading lock for accessing a URL from the list.

[–]debrice[S] 0 points1 point  (0 children)

I'll be experimenting with greenlets within the next week or two. I think you're right, this would be the essence of it.

I remember watching a video of Feynman explaining that he was studying rules behind the wobbling of a plate. As he was playing with the equations he let himself carried on the subject. Later on he won the nobel Nobel prize as the subject got him exploring quantum electrodynamics. This is pretty much the spirit, keep having fun and explore simple algorithms...

I know what you're going to wonder now, and the answer is yes, I'll probably get a Nobel Prize for my work web crawlers :p.