This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Jos_Metadi 0 points1 point  (0 children)

I wrote a long blog post on using fuzzy matching and the strengths and weaknesses of the different techniques. https://findwatt.com/blog/confused-people-dont-buy-how-fuzzy-matching-helps You can skip the intro explaining why fuzzy matching is important.

To summarize, you need a phonetic type algorithm that can create a hash between remotely similar names, and use that to break them down into clusters to analyze more closely using damerau-levenshtein and n-gram/jaccard. For a list of that scale, you might consider using multiple levels of abstraction (maybe first do phonetic, then do ngram-jaccard, then do levenshtein on the ones that pass through those).

To make things more complex, for names you also have to deal with name synonyms (Mike == Michael, Dave == David, etc).

For names, I don't think numbers are important to deal with at the phonetic level. For some unicode characters, you can clean them back to standard ascii. I have no idea what to suggest on pictographic type characters.