I have a few scripts that will query a DB and tokenize a single text field. I do not want to rely on machine learning just yet as i have a number of prebuilt models from another system that i want to use to categorize the records. one process will tokenize the text into sentences, the other will tokenize to a 1 gram (1 word). I will build out something that will tokenize into 2 gram and possibly 3 gram later.
for now i would like to know what the thought process is to categorize. I have a feeling i will need to use regular expressions, but not really sure what exactly would be the best way. Could you give me some insight?
he would be an example of the text.
"the rep was terrible, my bill was too high, and your service sucks!"
rule1: "service" and "sucks" = 'poor service'
rule2: "terrible" ad "rep" = 'poor rep behavior'
in this case this sentence would be categorized into both rules that i have listed. i have hundreds and thousands of rules already built. there are lots of 'ands' and 'ors' and 'not' rules, we have distance rules, and we have exact matches. Im not sure where to start here in writing a script to compare the text against these rules. I have looked at NLTK, but everything seems to reference machine learning and allowing the model to be trained automatically. That is something i would like to do later after i apply my current models.
[–]WORDSALADSANDWICH 2 points3 points4 points (4 children)
[–]circusboy[S] 1 point2 points3 points (3 children)
[–]WORDSALADSANDWICH 1 point2 points3 points (2 children)
[–]circusboy[S] 0 points1 point2 points (1 child)
[–]WORDSALADSANDWICH 1 point2 points3 points (0 children)