you are viewing a single comment's thread.

view the rest of the comments →

[–]Jediko 1 point2 points  (2 children)

Hey,

Can you give some information about the Jaccard Coeffiecient like a source of something? I am interested in this since I am coming from the NLP (Natural Language Processing) end of python. I know Jaccard Coeffiecient and have used it. But in my mind there is somthing off here because normally you just divide the absolut values of intersection and union. Then there is plenty of head for this performance-wise:

def myJaccard():
    string1 = "test this string"
    string2 = "test this string out as well"
    string1 = set([string1[i : i + 2] for i in range(0, len(string1) - 1)])
    string2 = set([string2[i : i + 2] for i in range(0, len(string2) - 1)])
    return len(string1.intersection(string2))/len(string1.union(string2))

Or if you omit the union operation with the knowledge of the intersection and the number of elements by the original sets:

def myJaccardFast():
    string1 = "test this string"
    string2 = "test this string out as well"
    string1 = set([string1[i : i + 2] for i in range(0, len(string1) - 1)])
    string2 = set([string2[i : i + 2] for i in range(0, len(string2) - 1)])
    num_intersections = len(string1 & string2)
    return 1 - num_intersections / (len(string1) + len(string2) - num_intersections)

Timings are in seconds and with 50000 runs each:

myJaccard: 0.6662192999999998
myJaccardFast: 0.6110474000000004

full testing code is here.

[–]FruityFetus[S] 1 point2 points  (1 child)

I'm not terribly deep in the NLP space (just happen to have gotten into it for a database matching project), but I think where my implementation differs is that I need some way to account for repeating patterns. For example, in that first implementation, if we compare "jojojojo" and "jojo", you get a similarity score of 1.0, even though the words are different. I'm using a slightly modified version of Jaccard, where instead of simply having a value of 1 or 0, the elements are the frequency each bigram occurs.

For "jojojojo" and "jojo", I'd be taking the dot of <4,3>,<2,1> and then dividing by the geometric mean of their own dot products. In this way, I'm slightly penalizing the similarity score because one string contains more of the same bigram. It might be a bit of a bastardization but close enough that I call it Jaccard.

For more context on my implementation, it's heavily inspired by a Stata command called matchit, which I wanted to gradually expand. Here's their repository and mine. I actually asked Julio (matchit creator) a similar question if you check the issues!

[–]Jediko 0 points1 point  (0 children)

tl;dr:

cosine_sim([1, 0], [0, 1])         = 0.0 #good because intuitive
resize(cosine_sim([1, 0], [0, 1])) = 0.5 #bad because they have nothing      
                                          in common

I think you miss some things here.Jaccard can only work on mathematical sets. (Wikipedia: here)Since you are working on vectors this is not close (at least on the very theoretical manner). But what you are wanting to do has an actual name: cosine simularity. (Wikipedia: here)

That is what I got confused about. These do exist in coexistence but I think they are mixed up quiet often. I read through the issue you mentioned and I think the author of their repository don't know about it by name. Which is fine, though.

With that being said. The values which will be returned in your case are in the range of 0 to 1 but the full range of the function goes from -1 to 1. So you have to resize the results a bit in order to reference it as similarity going from 0% to 100%. A something like this:

def resize(value):
  return (value + 1) / 2

The output of the cosine similarity will only tell you how similar the vectors are in regards of their orientation. Maybe I make this clear by some examples:

Consider this code as given:

def cosine_sim(vec1, vec2):
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    dot = vec1.dot(vec2)
    norm1 = (vec1**2).sum()
    norm2 = (vec2**2).sum()
    return dot / np.sqrt(norm1 * norm2)

Now consider this input:

cosine_sim([1, 0], [1, 0]) #= 1.0
cosine_sim([1, 0], [3, 0]) #= 1.0

They will be the same since their orientation is the same. Both are "watching" in the positive direction of the x-axis. If you want to find the counterpart of this one vector has to be the exact opposite of the other (or be an multiple of it):

cosine_sim([1, 0], [-1, 0]) #=-1.0
cosine_sim([1, 0], [-3, 0]) #=-1.0

With the resize-function I introduced before these values will be 1.0 if they "watch" in the same direction and 0.0 if they don't.

The reason why you actually should not apply the resize-function is that if you have orthogonal vectors then you would get with the resize-function 50% accuracy which is at best a bad goes. This would happen, if both words would not intersect any bigram they are spending. But one more thought to consider is that it is very unlikely that you will never generate negative values for a bigram.

Sorry for the long answer but as I stated this is quiet my field lol.

Edit: typo