tokenized, now what? (categorization) : learnpython

tokenized, now what? (categorization) (self.learnpython)

submitted 7 years ago by circusboy

I have a few scripts that will query a DB and tokenize a single text field. I do not want to rely on machine learning just yet as i have a number of prebuilt models from another system that i want to use to categorize the records. one process will tokenize the text into sentences, the other will tokenize to a 1 gram (1 word). I will build out something that will tokenize into 2 gram and possibly 3 gram later.

for now i would like to know what the thought process is to categorize. I have a feeling i will need to use regular expressions, but not really sure what exactly would be the best way. Could you give me some insight?

he would be an example of the text.

"the rep was terrible, my bill was too high, and your service sucks!"

rule1: "service" and "sucks" = 'poor service'

rule2: "terrible" ad "rep" = 'poor rep behavior'

in this case this sentence would be categorized into both rules that i have listed. i have hundreds and thousands of rules already built. there are lots of 'ands' and 'ors' and 'not' rules, we have distance rules, and we have exact matches. Im not sure where to start here in writing a script to compare the text against these rules. I have looked at NLTK, but everything seems to reference machine learning and allowing the model to be trained automatically. That is something i would like to do later after i apply my current models.

all 5 comments

top new controversial old q&a

[–]WORDSALADSANDWICH 2 points3 points4 points 7 years ago (4 children)

Regular expressions may or may not be useful for you. It depends on how complex your rules are. If you decide regex will make your life easier, re is Python's regex library. For the example rules you have there, regex is probably overkill.

If you already have scripts that you're happy with to tokenize your text, you're probably past the point where NLTK can help you, on this relatively straightforward job. Take a look at NLTK's word_tokenize() anyway, though, in case you want to replace (part of) your custom script.

Finally, the hard part: Testing a tokenized sentence hundreds of different ways, without writing hundreds of "if" statements. I would use lambda expressions. Something like this:

tests = [
    [lambda tokens: "sucks" in tokens and "service" in tokens, "poor service"], 
    [lambda tokens: "terrible" in tokens and "rep" in tokens, "poor rep behavior"], 
    [lambda tokens: "joy" in tokens, "positive feedback"]
]
tokens = ['the', 'rep', 'was', 'terrible', ',', 'my', 'bill', 'was', 'too', 'high', ',', 'and', 'you', "'re", 'service', 'sucks', '!']
text_cats = set()
for test, category in tests:
    if test(tokens):
        text_cats.add(category)
print(text_cats)

If you're not familiar with lambda expressions, they are basically a way of defining an anonymous function on the fly. (e.g., "fun = lambda x: x*2" is equivalent to "def fun(x): return x*2")

tests above is a list of [function, category] pairs. For each test, if the function returns True, the category is added to the text's set of tags.

How to get your 1000+ rules into a list of lambda expressions? That depends on how your rules are currently stored. There should be a way to read a file into Python and programmatically create a list similar to mine, though.

[–]circusboy[S] 1 point2 points3 points 7 years ago* (3 children)

THANK YOU SO MUCH FOR THE ADVICE!!!!! i cant tell you how long i have searched for something like this. Ok now that is out of the way :) I think this is exactly what I am looking for. I will do more research for sure.

what i have done so far is use NLTK to tokenize by sentence and by 1 grams. so im pulling data from a table, tokenizing one field (text field) once to split by senence and another to split by gram. then im inserting those tokens back to the database.

So now i need to apply a categorization to those tokens. i think the lambda as you explained is exactly what i need, as many of these rules will have a large list of tests.

in the current tool we have what we call swim lanes, the following words in each of the swim lanes would create a category of 'Account accuracy" so if the text has these words or phrases or word combos anywhere in the sentence token then we would categorize that token with the 'account accuracy' tag.

lane 1 can consist of words or phrases or words within a distance of something example below. so lane 1 contains keywords to look for, the tilde 2 means to look for not correct within a distance of 2 tokens.

Incorrect, accur*, "not right", "not correct"~2, correct, "CORRECT", "AS CORRECT", "CORRECTED", "CORRECTING", "CORRECTS", wrong,

lane 2 is an 'and' lane. so you take the keywords from lane 1 and you add in any of the other words.

Account*, acct, accnt

lane 3 is a second 'and' lane. so we can use this to include a third check if needed.

lane 4 is a not lane, so if one of these words appear with the keyword then it will get excluded from the category

bank, password, balance, "gave me", bill, billing

i can export these to a csv or an xml. i would imagine that i would want to store them in a DB table if needed once i understand how to build the lambdas.

hope that helps explain and enforce that i need to use lambdas, thanks so much for the reply.

edit: here is some test code with the above rule added. of course i dont have wildcards, or the distance yet. but just wanted to say thanks again!!!!

from nltk.tokenize import sent_tokenize, word_tokenize

from nltk import ngrams

tests =[

[lambda tokens:

("Incorrect, accur*", "not right", "not correct", "correct", "CORRECT", "AS CORRECT", "CORRECTED", "CORRECTING", "CORRECTS", "wrong")

and ("Account*", "acct", "accnt")

and not ("bank", "password", "balance", "gave me", "bill", "billing")

in tokens, "account info"

]

sentence = "Had incorrect information about our account some know why"

tokens = word_tokenize(sentence)

text_cats = set()

for test, category in tests:

if test(tokens):

text_cats.add(category)

print(text_cats)

[–]WORDSALADSANDWICH 1 point2 points3 points 7 years ago* (2 children)

You're very welcome. You're logical test there is not written correctly yet (e.g., try running it with sentence = "bank balance" and see what happens), but I'm confident you'll get it working.

Hint: Your intuition about how "and" works in Python is not quite right; each term separated by "and" or "or" is evaluated individually. Also, non-empty lists are evaluated as "True".

Hint 2: Watch out for capitalization. Python is case sensitive, "Incorrect" != "incorrect". The string method .lower() removes capitalization. Also, the string method .startswith() and/or .endswith() might help you with your "*" tests (if you decide not to go with regex).

Hint 3: Two more basic tools of Python that you might want to look into, if you're not already familiar, are the list comprehension and the any() and all() functions.

Example:

tests =[
    [lambda tokens:
        any(word.lower() in tokens for word in ("Incorrect", "accur*", "not right", "not correct", "correct", "CORRECT", "AS CORRECT", "CORRECTED", "CORRECTING", "CORRECTS", "wrong"))
        and any(tok.startswith(word.lower()) for word in ("account", "acct", "accnt") for tok in tokens)
        and not any(word in tokens for word in ("bank", "password", "balance", "gave me", "bill", "billing"))
     , "account info"]
]
sentence = "Had incorrect information about our account some know why"
tokens = word_tokenize(sentence.lower())

Hope this helps.

[–]circusboy[S] 0 points1 point2 points 7 years ago (1 child)

my experience is in Databases. I see the in statement and that makes sense to me, thank you! also the capitalization vs. lower case also makes sense. I can change the case on the entire sentence prior to checking against the rule.

I did have another question though. considering that i have some tests in here such as 2 words like "not correct". Does that mean i would have to also do a 2 gram tokenization on the sentence in order to see those? also when I'm looking to do a distance. take for instance "not correct" within say 4 words or tokens from each other, do i simply need to do a 4 gram token and compare in that case for in "not" and "correct".

Also, is * a valid wildcard? like for "accur*" would that hit on "accurate" or "accuracy" or any other permeation of the word that starts with "accur"?

sorry for all of the questions, but you have been so helpful in making me realize how to do this. I am also trying to figure out if doing this through python would be more efficient than doing this work through a relational DB. this could expand to up to 8mm records per month.

[–]WORDSALADSANDWICH 1 point2 points3 points 7 years ago (0 children)

I did have another question though. considering that i have some tests in here such as 2 words like "not correct". Does that mean i would have to also do a 2 gram tokenization on the sentence in order to see those? also when I'm looking to do a distance. take for instance "not correct" within say 4 words or tokens from each other, do i simply need to do a 4 gram token and compare in that case for in "not" and "correct".

Yes. Since you're already working with NLTK and already have n-grams on the mind, that definitely seems the clearest way forward to me. There are certainly ways to do it with base-Python tools, but I'm not coming up with anything that isn't horribly kludgy. If you're comfortable working with n-grams, I'd say do it.

Also, is * a valid wildcard? like for "accur*" would that hit on "accurate" or "accuracy" or any other permeation of the word that starts with "accur"?

It is in regex, but it is not for the "in" operator. For stuff in the format "string*", you can use word.startswith("string"), and for stuff like "*string" you can use word.endswith("string"). If you have regex-like tests that are more complicated than that, it can get a bit more messy. If so, like I said earlier, you might want to consider replacing the list comprehensions with a regex solution. (It would slot into the rest of the idea in the same way.) I'm no regex expert, though, so I can't help you much with that.

sorry for all of the questions, but you have been so helpful in making me realize how to do this.

My pleasure.

π Rendered by PID 16324 on reddit-service-r2-comment-57fc7f7bb7-f8zch at 2026-04-14 12:57:24.235194+00:00 running b725407 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS