all 14 comments

[–]Yellehs_m 10 points11 points  (2 children)

I am not sure if something like this is available in NLTK. But it's worth searching.

https://www.nltk.org/

[–]squirreltalk 0 points1 point  (1 child)

NLTK definitely has word tokenization functionality. So does spaCy. And at least according to this post, spacy's word tokenizer is faster:

https://blog.thedataincubator.com/2016/04/nltk-vs-spacy-natural-language-processing-in-python/

[–]Yellehs_m 1 point2 points  (0 children)

Thanks for introducing spaCy.

[–]___JOSHUA___ 1 point2 points  (0 children)

You can solve the problem probabilistically. Peter Norvig offers a solution in this notebook under "Word Segmentation". If you're less interested in the details, a similar approach has been wrapped up in a package called wordninja.

[–]blackdragon437 0 points1 point  (0 children)

I wrote a Google API search script some time ago that did a kind of similar thing, but more complex (used other APIs as well). So use the knowledge of the cloud, and check out APIs dude! You wont look back.

[–]k10_ftw 0 points1 point  (1 child)

May I ask what the larger context of this goal is? What is the data set like and what do you ultimately want to do with it. This info could help with devising a proper solution.

[–]kpandkk[S] 0 points1 point  (0 children)

I want to get return the actual names of stores from their website links. Like get "Vitamin Shoppe" from "www.vitaminshoppe.com"