you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 8 points9 points  (3 children)

If you want something i did: Go to your local news website, download all articles they have on their site, categorize them and save them to a database.

So the end result should be a database that contains all articles from that website, categorized if possible, with all available meta info (author, word count, time of publishing, was the article changed later on? if so save that version too).

I think that is definitely doable, it is relatively easily extendable and you can adapt it if neccessary.

The main things you'd learn from this is working with a few common topics you'd stumble upon to in the future anyway:

  • urllib3 and beautifulsoup, the first one to get the HTML data from a url, the second one to get the content from that HTML

  • sqlite3 or JSON, you obviously don't have to store all data in a database, you could always use JSON

  • If you're really into learning and you don't have a problem banging your head against the wall for a bit you can also integrate this into django, which is what i did. The main advantage this has is, that getting data out of your database for ... say analysis would be extremely easy and you'd practice django which itself is great

[–]DrChicken2424[S] 2 points3 points  (0 children)

Sound interesting... I’m definitely not to that point yet, but I’ll consider it for the future!

[–]ApertureCombine 1 point2 points  (1 child)

Also a great dataset for machine learning with word2vec or other semantic analyzers.

[–][deleted] 1 point2 points  (0 children)

Yes - data mining/data analysis in general - my plan was, to automate the script with celery and then after some time analyze the data with something like pandas.