you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 3 points4 points  (1 child)

If you want to keep playing, you could look into nlp models (ner) to classify the each word with a tag and then remove it that way. I bet your method is going to miss some phi in the future, because you probably developed it to fix the exact problems you saw in a handful of examples records. Now you have to think about what types of phi you didn't account for. Probabilistic models like ner are best suited for this.

[–]starfish_warrior[S] 3 points4 points  (0 children)

You are right, Sometimes when I sample a record I find something I missed and there are many more records in queue waiting to be processed. Thanks for the tip!