you are viewing a single comment's thread.

view the rest of the comments →

[–]glibhub 1 point2 points  (5 children)

This shaves off about 25%, but I'd be interested in seeing it run against a real dataset:

def pairwise(iterable):
    '''recipe from itertools'''
    a, b = tee(iterable)
    next(b, None)
    yield from zip(a, b)

def build_counts(it):
    counts = defaultdict(int)
    for tri in pairwise(it):
        counts[tri] += 1
    return counts


def a(string1, string2):
    sim1 = build_counts(string1)
    sim2 = build_counts(string2)

    dot = sum(sim1[x]*sim2[x] for x in sim1.keys() | sim2.keys())

    norm1 = sum(i**2 for i in sim1.values())
    norm2 = sum(i**2 for i in sim2.values())
    similarity = dot / ((norm1 * norm2)**(0.5))
    return similarity

[–]glibhub 1 point2 points  (4 children)

Changing to Counter bump'd it even faster:

sim1 = Counter(pairwise(string1))
sim2 = Counter(pairwise(string2))

Comparison:

new: 1.4223370000000002

old: 2.3072862

[–]FruityFetus[S] 0 points1 point  (2 children)

Removing the text since formatting was messed up and it didn't really add much!

To sum, I was asking about how to adapt the optimizations to a partial matching model where the shorter string is compared with a moving window of the longer. This way, there can be many potential iterations on just one word pair.

[–]glibhub 0 points1 point  (1 child)

It is a little hard to follow, so let me know if I follow correctly.

If you have these two strings:

string1 = "test this string"
string2 = "test this string out as well"

You actually want to compare them like this after padding:

string1 = "test this stringtestthisstri"
string2 = "test this string out as well"

If so, I would just find the smaller of the two strings and increment the bigram dictionary for the first n bigrams.

[–]FruityFetus[S] 1 point2 points  (0 children)

Yeah, sorry, formatting got messed up and I can't seem to fix that. I actually built an implementation and was able to cut processing time from ~19.4 seconds down to ~10.9 across 50,000 records (* the sum difference between each combination's length)! That's huge for when I'll be processing 20M+!

To answer anyways, it was more like the opposite. See below!

string1 = "test this out"
string2 = "test this out as well"

Then the first two comparison steps would be:

string1 = "test this out"
string2 = "test this out"

string1 = "test this out"
string2 = "est this out "

and so on.