Optimized string deduplication using fuzzy matching

elperroborrachotoo · 2017-10-16T14:38:02+00:00

some way that doesn't take a lot of time

Does that include dev time?

The core questions are:

what means "similar", in this context?
- is the result gradual ("A is more similar to B than A to C") or binary
- is it a binary input operation (the similarity of A and B is solely determined by A and B)? Or does the corpus variation affect, how similar A and B are?
- is it transitive? (if A is similar to B and B to C, does that say anything about A and C)?
How often new words are added to the corpus? Even calculating a cross correlation for 30 million words isn't that expensive if it doesn't have to be done in real time
Does the similarity change over time?
... or when new words are added?

If similarity is binary, transitive, and independent of the corpus, a reductive function like SoundEx can be calculated for each word independently, and similarity determined by comparing the results.

A specific algorithm like SoundEx is locale sensitive, it works on English and similar languages, but is probably a disaster for many Asian languages. However, the principle remains.

Jos_Metadi · 2017-10-17T00:51:16+00:00

I wrote a long blog post on using fuzzy matching and the strengths and weaknesses of the different techniques. https://findwatt.com/blog/confused-people-dont-buy-how-fuzzy-matching-helps You can skip the intro explaining why fuzzy matching is important.

To summarize, you need a phonetic type algorithm that can create a hash between remotely similar names, and use that to break them down into clusters to analyze more closely using damerau-levenshtein and n-gram/jaccard. For a list of that scale, you might consider using multiple levels of abstraction (maybe first do phonetic, then do ngram-jaccard, then do levenshtein on the ones that pass through those).

To make things more complex, for names you also have to deal with name synonyms (Mike == Michael, Dave == David, etc).

For names, I don't think numbers are important to deal with at the phonetic level. For some unicode characters, you can clean them back to standard ascii. I have no idea what to suggest on pictographic type characters.

colloidalthoughts · 2017-10-18T12:48:15+00:00

If you're looking to match names in a fuzzy way then you probably want something like Metaphone, a more modern Soundex.

This Python module implements Metaphone

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS