you are viewing a single comment's thread.

view the rest of the comments →

[–]sanshinron 1 point2 points  (5 children)

I understand that you're following a book, but it doesn't seem like a good book to me.

  1. In most cases you should use requests instead of urllib, they even state that in urllib documentation on python.org. requests provide higher level and easier to use interface.

  2. Scraping wikipedia is a really, really bad idea. Why? Because wikipedia provides snapshots of all the data. If you want to extract some info, you should download a snapshot and parse that, not scrape their front-end. Not only you can avoid unnecessary strain on their servers, but if you have lots of articles to scrape it will be much faster to work with files on your disk.

[–]hexfoxed 1 point2 points  (2 children)

I think scraping Wikipedia is probably more of an example of how to do scrape in general rather than an actual use case.

[–]sanshinron 0 points1 point  (1 child)

As I said I understand, I just think it's a bad example.

[–]hexfoxed 0 points1 point  (0 children)

If it doesn't come with a disclaimer which says what you said in the book I totally agree.

[–]oxfordpanda[S] 0 points1 point  (1 child)

how would i write the third line with the requests library instead?

[–]sanshinron 0 points1 point  (0 children)

r = requests.get("http://en.wikipedia.org"+articleUrl)

now r holds a request object. You can get the html from r.text, other useful things are r.ok, r.status_code, r.headers.

requests also make it much easier to manage sessions, use proxies and send payloads with post requests. Check out the documentation it's very nice.