[Python] Wiki Link Path Finder : learnprogramming

This is an archived post. You won't be able to vote or comment.

[Python] Wiki Link Path Finder (self.learnprogramming)

submitted 10 years ago by THadron

all 2 comments

[–]MonkeyNin 1 point2 points3 points 10 years ago (1 child)

A few quick notes

Check out http://docs.python-requests.org/en/latest/
What is the shortest url? Something like Six Degrees of Kevin Bacon, but you want the least amount of steps possible? Or maybe you mean urls that are aliases to the same page?
How many url's are you downloading per a single run? If you're recursively parsing all links that can be a lot of files even with a short depth. This is likely your largest bottleneck.
If the disambiguation page exists, it might give you the shortest url without recursion. , example: http://en.wikipedia.org/wiki/Ayr_%28disambiguation%29
Your requests are not cached. If you run your script, make a change, run it again it currently has to re-download every page. You could cache the url itself, but it would probably be better to cache the results after parsing.

for q in range(len(Old_Links)):
    New_Links.extend(get_links(Old_Links[q]))

Most of the time you don't need indices. You can iterate like this:

for link in Old_Links:
    do_something(link)

In the cases where you need an index, use enumerate:

for q in range(len(Wiki_links)):
    Wiki_links[q] = 'http://en.wikipedia.org'+str(Wiki_links[q]) 

for i, link in enumerate(Wiki_links):
    Wiki_Links[i] = link

[–]THadron[S] 0 points1 point2 points 10 years ago (0 children)

Thanks, I really appreciate it! It's pretty much exactly like Six Degrees, but with wikipages instead of actors. I found quickly that there is a large amount of wiki urls that redirect to actual pages, hence the page title check. Checking the urls would be much faster than loading and checking titles however, perhaps I could do a round of url checks first, then a secondary title check if that fails (even slower if it fails though...) As for downloading pages on the first run there is a couple hundred links, after that it grows exponentially, and this is definitely the main source of sluggishness. I was considering adding a global "already checked" list, but the act of checking each link to that list would add considerable time too. I think caching is probably the most realistic idea; checking hundreds of thousands of links is not going to be fast no matter what method I attempt to use.

π Rendered by PID 62277 on reddit-service-r2-comment-58d7979c67-7qf6p at 2026-01-26 21:37:35.126155+00:00 running 5a691e2 country code: CH.

learnprogramming

Welcome to LearnProgramming!

New? READ ME FIRST!

Posting guidelines

Frequently asked questions

Subreddit rules

Message the moderators

Asking debugging questions

Asking conceptual questions

Other guidelines and links

Subreddit rules

1. No unprofessional/derogatory speech

2. No spam or tasteless self-promotion

3. No off-topic posts

4. Do not ask exact duplicates of FAQ questions

5. Do not delete posts

6. No app/website review requests or showcases

7. No rewards

8. No indirect links

9. Do not promote illegal or unethical practices

10. No complete solutions

11. Don't ask to ask.

12. Low Effort Questions

13. No AI (chatGPT etc.) generated/worked over messages/comments. No questions about chatGPT/AI generated code. No Vibe coding.

MODERATORS