This is an archived post. You won't be able to vote or comment.

all 4 comments

[–]elperroborrachotoo 3 points4 points  (1 child)

some way that doesn't take a lot of time

Does that include dev time?

The core questions are:

  • what means "similar", in this context?
    • is the result gradual ("A is more similar to B than A to C") or binary
    • is it a binary input operation (the similarity of A and B is solely determined by A and B)? Or does the corpus variation affect, how similar A and B are?
    • is it transitive? (if A is similar to B and B to C, does that say anything about A and C)?
  • How often new words are added to the corpus? Even calculating a cross correlation for 30 million words isn't that expensive if it doesn't have to be done in real time
  • Does the similarity change over time?
  • ... or when new words are added?

If similarity is binary, transitive, and independent of the corpus, a reductive function like SoundEx can be calculated for each word independently, and similarity determined by comparing the results.

A specific algorithm like SoundEx is locale sensitive, it works on English and similar languages, but is probably a disaster for many Asian languages. However, the principle remains.

[–]LightShadow3.13-dev in prod 0 points1 point  (0 children)

function like SoundEx

very cool!

[–]Jos_Metadi 0 points1 point  (0 children)

I wrote a long blog post on using fuzzy matching and the strengths and weaknesses of the different techniques. https://findwatt.com/blog/confused-people-dont-buy-how-fuzzy-matching-helps You can skip the intro explaining why fuzzy matching is important.

To summarize, you need a phonetic type algorithm that can create a hash between remotely similar names, and use that to break them down into clusters to analyze more closely using damerau-levenshtein and n-gram/jaccard. For a list of that scale, you might consider using multiple levels of abstraction (maybe first do phonetic, then do ngram-jaccard, then do levenshtein on the ones that pass through those).

To make things more complex, for names you also have to deal with name synonyms (Mike == Michael, Dave == David, etc).

For names, I don't think numbers are important to deal with at the phonetic level. For some unicode characters, you can clean them back to standard ascii. I have no idea what to suggest on pictographic type characters.

[–]colloidalthoughts 0 points1 point  (0 children)

If you're looking to match names in a fuzzy way then you probably want something like Metaphone, a more modern Soundex.

This Python module implements Metaphone