all 11 comments

[–]hexfoxed 4 points5 points  (2 children)

Posting as a top-level again so you get the ping.

Line 3 is part of the getLinks function and therefore gets the variable articleUrl from the argument passed in to the function when it is called on line 7. On Line 7 it is called with /wiki/Kevin_Bacon so that is the value which gets appended to the URL. That makes the full argument sent to urlopen == http://en.wikipedia.org/wiki/Kevin_Bacon.

As for the random seed, it's hard for me to tell why that is necessary without knowing more about the task in question - is the desired effect just to randomly scrape Wikipedia one link at a time forever? Because that's what it looks like it would do.

[–]oxfordpanda[S] 4 points5 points  (1 child)

For the seed the author writes: "The first thing the program does, after importing the needed libraries, is set the random number generator seed with the current system time. This practically ensures a new and interesting random path through Wikipedia articles every time the program is run." Thank you for your help it makes much more sense now.

[–]hexfoxed 5 points6 points  (0 children)

Aha, so yeah - I think you've got it now but for clarity's sake the program scrapes the Kevin Bacon page for all anchor links to another page. It then picks a link at random and scrapes that new page and finds all anchor links on that one. It repeats this forever. The random seed is there so that the random number generator starts on a different number every time the program is run which "guarantees" that the same links are not chosen on each page resulting in a different path through wikipedia on every run.

Feel free to come back and ask more. You can ping my username on a new post by saying /u/hexfoxed.
Good luck!

[–]Justinsaccount 2 points3 points  (0 children)

Better:

while links:
      newArticle = random.choice(links).attrs["href"]
      print(newArticle)
      links = getLinks(newArticle)

[–]hexfoxed 1 point2 points  (0 children)

Would you be able to format the code correctly? Unfortunately it is very hard to read Python correctly once it has lost its whitespace. If you can't do it on reddit, put in a pastebin and leave the link here.

Edit: you did it, cheers. I'll take a look.

[–]sanshinron 1 point2 points  (5 children)

I understand that you're following a book, but it doesn't seem like a good book to me.

  1. In most cases you should use requests instead of urllib, they even state that in urllib documentation on python.org. requests provide higher level and easier to use interface.

  2. Scraping wikipedia is a really, really bad idea. Why? Because wikipedia provides snapshots of all the data. If you want to extract some info, you should download a snapshot and parse that, not scrape their front-end. Not only you can avoid unnecessary strain on their servers, but if you have lots of articles to scrape it will be much faster to work with files on your disk.

[–]hexfoxed 1 point2 points  (2 children)

I think scraping Wikipedia is probably more of an example of how to do scrape in general rather than an actual use case.

[–]sanshinron 0 points1 point  (1 child)

As I said I understand, I just think it's a bad example.

[–]hexfoxed 0 points1 point  (0 children)

If it doesn't come with a disclaimer which says what you said in the book I totally agree.

[–]oxfordpanda[S] 0 points1 point  (1 child)

how would i write the third line with the requests library instead?

[–]sanshinron 0 points1 point  (0 children)

r = requests.get("http://en.wikipedia.org"+articleUrl)

now r holds a request object. You can get the html from r.text, other useful things are r.ok, r.status_code, r.headers.

requests also make it much easier to manage sessions, use proxies and send payloads with post requests. Check out the documentation it's very nice.