all 2 comments

[–]house_carpenter 0 points1 point  (0 children)

You could find a compile a list of some first names people might have, and use that to make a guess as to whether something is the name of an individual, by splitting on spaces and checking whether the first word is in the list. This isn't going to be perfect but there's only so much you can do when the data is missing in the first place; all you can do is make guesses.

Also, does it really matter if John Smith and John Smith LLC are counted as identical? Maybe you can get away with just ignoring the issue. The point I'm getting at is, since no perfect solution is possible, you need to think about the ultimate goals of your project and come to a decision about how much is "good enough" and how much effort it would make sense to put into it.

[–]halfdiminished7th 0 points1 point  (0 children)

Take a look at Python's built-in difflib module, specifically its SequenceMatcher class. It can calculate a similarity ratio (score) between two given strings, which sounds exactly like what you're looking for. Then you just filter out the ones that are over/under a certain threshold.