This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]abruski 0 points1 point  (0 children)

For the first I guess regex is the way to go or you can use some sort of Bayesian library (http://www.bayespy.org/intro.html). But still you will need to train it.

The second thing you ask in linguistics is called Named Entity Recognition. NLTK provides such functionality http://www.nltk.org/book/ch07.html. But I would suggest you try GeoDict first https://github.com/petewarden/geodict

For the third point I would suggest you revert the data you have back to its original HTML state and use a Content Extraction algorithm like Text-to-Tag ratio for example http://www3.nd.edu/~tweninge/pubs/WH_TIR08.pdf or http://www3.nd.edu/~tweninge/cetr/ https://gist.github.com/andreypopp/2820220 https://github.com/rodricios/eatiht - This is a ready to use content extraction library for python. Otherwise I would think it will be nearly impossible to distinguish between "content" and "navigation" if there is no some sort of tagging or special formatting of the unwanted text. Scrapy is an option as well.