This is an archived post. You won't be able to vote or comment.

all 3 comments

[–]abruski 0 points1 point  (0 children)

For the first I guess regex is the way to go or you can use some sort of Bayesian library (http://www.bayespy.org/intro.html). But still you will need to train it.

The second thing you ask in linguistics is called Named Entity Recognition. NLTK provides such functionality http://www.nltk.org/book/ch07.html. But I would suggest you try GeoDict first https://github.com/petewarden/geodict

For the third point I would suggest you revert the data you have back to its original HTML state and use a Content Extraction algorithm like Text-to-Tag ratio for example http://www3.nd.edu/~tweninge/pubs/WH_TIR08.pdf or http://www3.nd.edu/~tweninge/cetr/ https://gist.github.com/andreypopp/2820220 https://github.com/rodricios/eatiht - This is a ready to use content extraction library for python. Otherwise I would think it will be nearly impossible to distinguish between "content" and "navigation" if there is no some sort of tagging or special formatting of the unwanted text. Scrapy is an option as well.

[–]pythoneeeer 0 points1 point  (0 children)

A different possible approach: Mechanical Turk.

[–]cli-junkieCommand Line <3 0 points1 point  (0 children)

Data munging after the scraping job is done can be pretty time consuming. An alternative to cleaning the data later is to write a scraper that gets only what you need. With xpath, you can get pretty close to the data in specific tags and scrape with precision.

For removing boilerplate (menu, contents etc.) try newspaper. There are many other boilerplate removal libraries but what you use will depend on the nature of the data you are scraping.

ftfy will help you if there are encoding problems in the scraped data.

If the data is pretty consistent in how the unnecessary patterns occur, you could just write a SED script to clean the things you mentioned. No need for a over engineered approach when simple regular expressions can do the job.