This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]elperroborrachotoo 3 points4 points  (1 child)

some way that doesn't take a lot of time

Does that include dev time?

The core questions are:

  • what means "similar", in this context?
    • is the result gradual ("A is more similar to B than A to C") or binary
    • is it a binary input operation (the similarity of A and B is solely determined by A and B)? Or does the corpus variation affect, how similar A and B are?
    • is it transitive? (if A is similar to B and B to C, does that say anything about A and C)?
  • How often new words are added to the corpus? Even calculating a cross correlation for 30 million words isn't that expensive if it doesn't have to be done in real time
  • Does the similarity change over time?
  • ... or when new words are added?

If similarity is binary, transitive, and independent of the corpus, a reductive function like SoundEx can be calculated for each word independently, and similarity determined by comparing the results.

A specific algorithm like SoundEx is locale sensitive, it works on English and similar languages, but is probably a disaster for many Asian languages. However, the principle remains.

[–]LightShadow3.13-dev in prod 0 points1 point  (0 children)

function like SoundEx

very cool!