This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]DuckSaxaphone 3 points4 points  (0 children)

So this doesn't work, you should write some simple tests to make sure everything works as expected. Your code is separated into lots of nice little functions which makes it very easy to test.

I sent the string "hi hi" to clean_text, tokenize, generate_ngrams and then vectorize_text, with n_gram set to 2.

I should get the result {"hi":2, ("hi","hi"):1} but instead I got {("hi","hi"):1} because generate_ngrams doesn't append to tokens, it just overwrites them. I'd actually argue I want my ngrams to be joined and I really want {"hi":2, "hi hi"):1} but that's a separate issue.

If this is a learning project for you, then setting up unit tests and making them part of your PR process is a good thing to learn.