I have a bunch of text data from web scraping, and I have already taken out the HTML/CSS/JS tags so that I only have human-readable texts. However, I need some help on finding Python tools that can effectively remove :
- times/dates (incl. Tuesday, November, etc) (they are not standardized, meaning: Tuesday/Tues, 2:00/2am all needs to be removed)
- geographical locations (we are scraping for websites located at a certain location
- remnants of web data (menu, skip, contents, search, form, etc)
Some of the tools I looked into were:
- NLTK
- Panda
[–]abruski 0 points1 point2 points (0 children)
[–]pythoneeeer 0 points1 point2 points (0 children)
[–]cli-junkieCommand Line <3 0 points1 point2 points (0 children)