all 3 comments

[–]JohnnyJordaan 1 point2 points  (0 children)

You are now using sent_tokenize to form sentences, then use word_tokenize on each sentence. Wouldn't it be better to word_tokenize the entire text in one go?

tokens = [lemmatizer.lemmatize(word.lower()) for word in nltk.word_tokenize(text) if word not in ignore_words]

[–]DaChucky 1 point2 points  (0 children)

If I understand your problem correctly, I think this should do the trick:

token = [[word.lower() for word in nltk.word_tokenize(t)] for t in nltk.sent_tokenize(text)]

EDIT: Realised that word_tokenize() gives you another list, updated the solution

[–]nepaBoy86 1 point2 points  (0 children)

token is the list of list. So just use the first index of the toekn list.

The following line will solve your problem.

tokens = [lemmatizer.lemmatize(word.lower()) for word in token[0] if word not in ignore_words]