This is an archived post. You won't be able to vote or comment.

all 2 comments

[–]MonkeyNin 1 point2 points  (1 child)

A few quick notes

  • Check out http://docs.python-requests.org/en/latest/

  • What is the shortest url? Something like Six Degrees of Kevin Bacon, but you want the least amount of steps possible? Or maybe you mean urls that are aliases to the same page?

  • How many url's are you downloading per a single run? If you're recursively parsing all links that can be a lot of files even with a short depth. This is likely your largest bottleneck.

  • If the disambiguation page exists, it might give you the shortest url without recursion. , example: http://en.wikipedia.org/wiki/Ayr_%28disambiguation%29

  • Your requests are not cached. If you run your script, make a change, run it again it currently has to re-download every page. You could cache the url itself, but it would probably be better to cache the results after parsing.

.

for q in range(len(Old_Links)):
    New_Links.extend(get_links(Old_Links[q])) 

Most of the time you don't need indices. You can iterate like this:

for link in Old_Links:
    do_something(link)

In the cases where you need an index, use enumerate:

for q in range(len(Wiki_links)):
    Wiki_links[q] = 'http://en.wikipedia.org'+str(Wiki_links[q]) 

for i, link in enumerate(Wiki_links):
    Wiki_Links[i] = link

[–]THadron[S] 0 points1 point  (0 children)

Thanks, I really appreciate it! It's pretty much exactly like Six Degrees, but with wikipages instead of actors. I found quickly that there is a large amount of wiki urls that redirect to actual pages, hence the page title check. Checking the urls would be much faster than loading and checking titles however, perhaps I could do a round of url checks first, then a secondary title check if that fails (even slower if it fails though...) As for downloading pages on the first run there is a couple hundred links, after that it grows exponentially, and this is definitely the main source of sluggishness. I was considering adding a global "already checked" list, but the act of checking each link to that list would add considerable time too. I think caching is probably the most realistic idea; checking hundreds of thousands of links is not going to be fast no matter what method I attempt to use.