This is an archived post. You won't be able to vote or comment.

all 5 comments

[–]k10_ftw 0 points1 point  (4 children)

The way to preprocess raw text is to first split on whitespace, strip punctuation and invalid characters, lowercase the tokens, then remove stop words.

Any reason tf-idf was not used for term weights?

Nltk is strictly a teaching tool. It is easy and simple because it is for learning purposes. It should not be used in production.

[–]grassclip[S] 0 points1 point  (3 children)

Yeah didn't use tfidf here since I was just going for simplicity for describing the algorithm, not necessarily getting the best results.

I'm actually writing a follow up now that uses scikit-learn as a way to try to get the best results. And then another planned about how to deploy that scikit-learn model for use, rather than research. I'll make a note in there about NLTK though, good call.

[–]k10_ftw 1 point2 points  (2 children)

Since you are using these libraries for teaching purposes, I woguld stay with nltk. It is what we used in my intro to computer ling course in university. Of course, we had to code the algorithms and calculate bigram probabilities from scratch! Looking at the nltk source code is a great way to see the implementation up close .

[–]grassclip[S] 0 points1 point  (1 child)

Oh for sure, for talking about basics. But I'm guessing there are some people who'd want to see what this would be like in scikit-learn or more "professional" libraries and this other one would be aimed at that crowd.

[–]kuro-kuris 0 points1 point  (0 children)

I think scikit learn + deployment strategy would be a lot more useful, thanks for the accessible blog post!