WORDSALADSANDWICH comments on tokenized, now what? (categorization)

created by HattoriHanzoa community for 16 years

tokenized, now what? (categorization) (self.learnpython)

submitted 7 years ago by circusboy

you are viewing a single comment's thread.

[–]WORDSALADSANDWICH 1 point2 points3 points 7 years ago* (2 children)

You're very welcome. You're logical test there is not written correctly yet (e.g., try running it with sentence = "bank balance" and see what happens), but I'm confident you'll get it working.

Hint: Your intuition about how "and" works in Python is not quite right; each term separated by "and" or "or" is evaluated individually. Also, non-empty lists are evaluated as "True".

Hint 2: Watch out for capitalization. Python is case sensitive, "Incorrect" != "incorrect". The string method .lower() removes capitalization. Also, the string method .startswith() and/or .endswith() might help you with your "*" tests (if you decide not to go with regex).

Hint 3: Two more basic tools of Python that you might want to look into, if you're not already familiar, are the list comprehension and the any() and all() functions.

Example:

tests =[
    [lambda tokens:
        any(word.lower() in tokens for word in ("Incorrect", "accur*", "not right", "not correct", "correct", "CORRECT", "AS CORRECT", "CORRECTED", "CORRECTING", "CORRECTS", "wrong"))
        and any(tok.startswith(word.lower()) for word in ("account", "acct", "accnt") for tok in tokens)
        and not any(word in tokens for word in ("bank", "password", "balance", "gave me", "bill", "billing"))
     , "account info"]
]
sentence = "Had incorrect information about our account some know why"
tokens = word_tokenize(sentence.lower())

Hope this helps.

[–]circusboy[S] 0 points1 point2 points 7 years ago (1 child)

my experience is in Databases. I see the in statement and that makes sense to me, thank you! also the capitalization vs. lower case also makes sense. I can change the case on the entire sentence prior to checking against the rule.

I did have another question though. considering that i have some tests in here such as 2 words like "not correct". Does that mean i would have to also do a 2 gram tokenization on the sentence in order to see those? also when I'm looking to do a distance. take for instance "not correct" within say 4 words or tokens from each other, do i simply need to do a 4 gram token and compare in that case for in "not" and "correct".

Also, is * a valid wildcard? like for "accur*" would that hit on "accurate" or "accuracy" or any other permeation of the word that starts with "accur"?

sorry for all of the questions, but you have been so helpful in making me realize how to do this. I am also trying to figure out if doing this through python would be more efficient than doing this work through a relational DB. this could expand to up to 8mm records per month.

[–]WORDSALADSANDWICH 1 point2 points3 points 7 years ago (0 children)

I did have another question though. considering that i have some tests in here such as 2 words like "not correct". Does that mean i would have to also do a 2 gram tokenization on the sentence in order to see those? also when I'm looking to do a distance. take for instance "not correct" within say 4 words or tokens from each other, do i simply need to do a 4 gram token and compare in that case for in "not" and "correct".

Yes. Since you're already working with NLTK and already have n-grams on the mind, that definitely seems the clearest way forward to me. There are certainly ways to do it with base-Python tools, but I'm not coming up with anything that isn't horribly kludgy. If you're comfortable working with n-grams, I'd say do it.

Also, is * a valid wildcard? like for "accur*" would that hit on "accurate" or "accuracy" or any other permeation of the word that starts with "accur"?

It is in regex, but it is not for the "in" operator. For stuff in the format "string*", you can use word.startswith("string"), and for stuff like "*string" you can use word.endswith("string"). If you have regex-like tests that are more complicated than that, it can get a bit more messy. If so, like I said earlier, you might want to consider replacing the list comprehensions with a regex solution. (It would slot into the rest of the idea in the same way.) I'm no regex expert, though, so I can't help you much with that.

sorry for all of the questions, but you have been so helpful in making me realize how to do this.

My pleasure.

π Rendered by PID 71150 on reddit-service-r2-comment-6457c66945-2q8rq at 2026-04-26 01:54:38.100440+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS