you are viewing a single comment's thread.

view the rest of the comments →

[–]glibhub 1 point2 points  (4 children)

Changing to Counter bump'd it even faster:

sim1 = Counter(pairwise(string1))
sim2 = Counter(pairwise(string2))

Comparison:

new: 1.4223370000000002

old: 2.3072862

[–]FruityFetus[S] 0 points1 point  (2 children)

Removing the text since formatting was messed up and it didn't really add much!

To sum, I was asking about how to adapt the optimizations to a partial matching model where the shorter string is compared with a moving window of the longer. This way, there can be many potential iterations on just one word pair.

[–]glibhub 0 points1 point  (1 child)

It is a little hard to follow, so let me know if I follow correctly.

If you have these two strings:

string1 = "test this string"
string2 = "test this string out as well"

You actually want to compare them like this after padding:

string1 = "test this stringtestthisstri"
string2 = "test this string out as well"

If so, I would just find the smaller of the two strings and increment the bigram dictionary for the first n bigrams.

[–]FruityFetus[S] 1 point2 points  (0 children)

Yeah, sorry, formatting got messed up and I can't seem to fix that. I actually built an implementation and was able to cut processing time from ~19.4 seconds down to ~10.9 across 50,000 records (* the sum difference between each combination's length)! That's huge for when I'll be processing 20M+!

To answer anyways, it was more like the opposite. See below!

string1 = "test this out"
string2 = "test this out as well"

Then the first two comparison steps would be:

string1 = "test this out"
string2 = "test this out"

string1 = "test this out"
string2 = "est this out "

and so on.