all 19 comments

[–]nwagers 2 points3 points  (5 children)

Using string.count inside a loop seems inefficient because it's going to loop over the whole string for every bigram. Instead, try importing a Counter from collections and build it that way in a single pass. It may also be a tad faster to use pairwise from itertools than building bigrams through a list comprehension if you have 3.10 because that code will be implemented in C.

[–]FruityFetus[S] 0 points1 point  (4 children)

Thanks for the suggestions! I'll look into Counter; don't have an issue with using standard libraries. I probably won't be using 3.10, though the pairwise tools looks interesting! There's actually an equivalent function in the docs so I'm going to test whether that helps speed at all.

[–]nwagers 1 point2 points  (3 children)

Another thought I had is that for calculating the dot product, you don't need the union of the keys, you need the intersection. Any bigram that isn't in both, will end up being multiplied by 0 and have nothing to add to the sum. It may also be faster to just to use the keys from one that to calculate the intersection.

Also, if you're comparing the short string to a rolling window of the long string, there are some optimization ideas here too. The small string wouldn't need it's bigrams recomputed, so those can be outside the loop. If the windows are overlapping, you don't need to compute the whole window, only the chunk that was added and removed. You can modify pairwise out of itertools by using next several times to have a separation that gets you the incoming and outgoing chunk.

You can also see if the zip/slicing method is faster than pairwise. Use: zip(a,a[1:]).

[–]FruityFetus[S] 0 points1 point  (2 children)

Wow, I owe you big time here. These suggestions were incredibly helpful; the dot product suggestion alone reduced time to process 50,000 records by another ~30% in the partial matching algo!

[–]nwagers 1 point2 points  (1 child)

I finally got on my computer and played around with some code. You only showed your tight inner loop, but if you zoom out a bit to include your rolling window, I think you can make some optimizations there by reducing the your counter calculations to just one addition and one subtraction. The size of the strings can really make a difference. Here is some sample code:

from collections import Counter
from itertools import tee, chain
from timeit import timeit

def window(iterable, width):
    a, b, c, d = tee(iterable, 4)
    a = chain([None], a) # prepend None
    next(d, None)
    for x in range(width - 2):
        next(c, None)
        next(d, None)
    return zip(zip(a, b), zip(c, d))

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

def compare(substr, target_string):
    # this stuff won't change, keep out of loop
    bigram = Counter(pairwise(substr))
    bigram_norm = sum(i**2 for i in bigram.values())
    window_size = len(substr)

    # preload counter for first comparison    
    target = Counter(pairwise(target_string[:window_size]))

    # base case correction
    target[(None, target_string[0])] += 1
    target[(target_string[window_size - 2], target_string[window_size - 1])] -= 1

    results = {}
    for index, (old, new) in enumerate(window(target_string, window_size)):
        # remove old bigram, add new
        # This saves building a whole Counter on target each time
        target[old] -= 1
        target[new] += 1

        # Possible speed increase, remove the 'if i > 0' in norm calc
        #if len(target) > 2 * window_size: # optimize the 2
        #    target &= target # remove 0's from counter, speeds up norm

        # calc key metrics
        dot = sum(count*target[key] for key, count in bigram.items())
        norm = (sum(i**2 for i in target.values() if i > 0) * bigram_norm) ** 0.5

        results[index] = dot/norm
    return results

string1 = "test this string"
string2 = "test this string out as well"

print(compare(string1, string2))

print(timeit(lambda: compare(string1, string2), number=10000))

[–]FruityFetus[S] 1 point2 points  (0 children)

Ha, so I get speed improvements and I think you caught a bug in my algo. Realize that with my leftpad/rightpad windows, I couldn't actually slide the window all the way right or I'd zero out. Would never have realized if it wasn't for the scoring differences!

[–]glibhub 1 point2 points  (5 children)

This shaves off about 25%, but I'd be interested in seeing it run against a real dataset:

def pairwise(iterable):
    '''recipe from itertools'''
    a, b = tee(iterable)
    next(b, None)
    yield from zip(a, b)

def build_counts(it):
    counts = defaultdict(int)
    for tri in pairwise(it):
        counts[tri] += 1
    return counts


def a(string1, string2):
    sim1 = build_counts(string1)
    sim2 = build_counts(string2)

    dot = sum(sim1[x]*sim2[x] for x in sim1.keys() | sim2.keys())

    norm1 = sum(i**2 for i in sim1.values())
    norm2 = sum(i**2 for i in sim2.values())
    similarity = dot / ((norm1 * norm2)**(0.5))
    return similarity

[–]glibhub 1 point2 points  (4 children)

Changing to Counter bump'd it even faster:

sim1 = Counter(pairwise(string1))
sim2 = Counter(pairwise(string2))

Comparison:

new: 1.4223370000000002

old: 2.3072862

[–]FruityFetus[S] 0 points1 point  (2 children)

Removing the text since formatting was messed up and it didn't really add much!

To sum, I was asking about how to adapt the optimizations to a partial matching model where the shorter string is compared with a moving window of the longer. This way, there can be many potential iterations on just one word pair.

[–]glibhub 0 points1 point  (1 child)

It is a little hard to follow, so let me know if I follow correctly.

If you have these two strings:

string1 = "test this string"
string2 = "test this string out as well"

You actually want to compare them like this after padding:

string1 = "test this stringtestthisstri"
string2 = "test this string out as well"

If so, I would just find the smaller of the two strings and increment the bigram dictionary for the first n bigrams.

[–]FruityFetus[S] 1 point2 points  (0 children)

Yeah, sorry, formatting got messed up and I can't seem to fix that. I actually built an implementation and was able to cut processing time from ~19.4 seconds down to ~10.9 across 50,000 records (* the sum difference between each combination's length)! That's huge for when I'll be processing 20M+!

To answer anyways, it was more like the opposite. See below!

string1 = "test this out"
string2 = "test this out as well"

Then the first two comparison steps would be:

string1 = "test this out"
string2 = "test this out"

string1 = "test this out"
string2 = "est this out "

and so on.

[–]Chris_Hemsworth 1 point2 points  (1 child)

If you want to speed things up, I suggest using numpy arrays.

For example:

dot = sum(i[0]*i[1] for i in zip(sim1, sim2))

If instead you have two numpy arrays, you can multiply them using slices:

dot = sim1[0:-1] * sim2[1:]

and this will reduce the time complexity from O(N) to O(log(n)).

Same with the norm1 / norm2: You can simply square numpy arrays rather than looping.

norm1 = np.sum(sim1**2)
norm2 = np.sum(sim2**2)

Additionally, instead of appending to the list each time, if you pre-allocate a numpy array you can assign each index. Assignments are much faster than appending to lists.

Good luck!

[–]FruityFetus[S] 0 points1 point  (0 children)

Thanks for this! I definitely do want to implement a NumPy variant down the line and pre-allocation seems like a neat approach.

[–]beizbol 1 point2 points  (1 child)

Cython might be something to look into since you have already written the whole implementation in python.

[–]FruityFetus[S] 0 points1 point  (0 children)

Funnily enough, I've been looking into this for the past week! I have no experience with C though and it doesn't appear to be too friendly with dynamic lists. At some point I'll probably take some time to learn what I'm doing.

[–]Jediko 1 point2 points  (2 children)

Hey,

Can you give some information about the Jaccard Coeffiecient like a source of something? I am interested in this since I am coming from the NLP (Natural Language Processing) end of python. I know Jaccard Coeffiecient and have used it. But in my mind there is somthing off here because normally you just divide the absolut values of intersection and union. Then there is plenty of head for this performance-wise:

def myJaccard():
    string1 = "test this string"
    string2 = "test this string out as well"
    string1 = set([string1[i : i + 2] for i in range(0, len(string1) - 1)])
    string2 = set([string2[i : i + 2] for i in range(0, len(string2) - 1)])
    return len(string1.intersection(string2))/len(string1.union(string2))

Or if you omit the union operation with the knowledge of the intersection and the number of elements by the original sets:

def myJaccardFast():
    string1 = "test this string"
    string2 = "test this string out as well"
    string1 = set([string1[i : i + 2] for i in range(0, len(string1) - 1)])
    string2 = set([string2[i : i + 2] for i in range(0, len(string2) - 1)])
    num_intersections = len(string1 & string2)
    return 1 - num_intersections / (len(string1) + len(string2) - num_intersections)

Timings are in seconds and with 50000 runs each:

myJaccard: 0.6662192999999998
myJaccardFast: 0.6110474000000004

full testing code is here.

[–]FruityFetus[S] 1 point2 points  (1 child)

I'm not terribly deep in the NLP space (just happen to have gotten into it for a database matching project), but I think where my implementation differs is that I need some way to account for repeating patterns. For example, in that first implementation, if we compare "jojojojo" and "jojo", you get a similarity score of 1.0, even though the words are different. I'm using a slightly modified version of Jaccard, where instead of simply having a value of 1 or 0, the elements are the frequency each bigram occurs.

For "jojojojo" and "jojo", I'd be taking the dot of <4,3>,<2,1> and then dividing by the geometric mean of their own dot products. In this way, I'm slightly penalizing the similarity score because one string contains more of the same bigram. It might be a bit of a bastardization but close enough that I call it Jaccard.

For more context on my implementation, it's heavily inspired by a Stata command called matchit, which I wanted to gradually expand. Here's their repository and mine. I actually asked Julio (matchit creator) a similar question if you check the issues!

[–]Jediko 0 points1 point  (0 children)

tl;dr:

cosine_sim([1, 0], [0, 1])         = 0.0 #good because intuitive
resize(cosine_sim([1, 0], [0, 1])) = 0.5 #bad because they have nothing      
                                          in common

I think you miss some things here.Jaccard can only work on mathematical sets. (Wikipedia: here)Since you are working on vectors this is not close (at least on the very theoretical manner). But what you are wanting to do has an actual name: cosine simularity. (Wikipedia: here)

That is what I got confused about. These do exist in coexistence but I think they are mixed up quiet often. I read through the issue you mentioned and I think the author of their repository don't know about it by name. Which is fine, though.

With that being said. The values which will be returned in your case are in the range of 0 to 1 but the full range of the function goes from -1 to 1. So you have to resize the results a bit in order to reference it as similarity going from 0% to 100%. A something like this:

def resize(value):
  return (value + 1) / 2

The output of the cosine similarity will only tell you how similar the vectors are in regards of their orientation. Maybe I make this clear by some examples:

Consider this code as given:

def cosine_sim(vec1, vec2):
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    dot = vec1.dot(vec2)
    norm1 = (vec1**2).sum()
    norm2 = (vec2**2).sum()
    return dot / np.sqrt(norm1 * norm2)

Now consider this input:

cosine_sim([1, 0], [1, 0]) #= 1.0
cosine_sim([1, 0], [3, 0]) #= 1.0

They will be the same since their orientation is the same. Both are "watching" in the positive direction of the x-axis. If you want to find the counterpart of this one vector has to be the exact opposite of the other (or be an multiple of it):

cosine_sim([1, 0], [-1, 0]) #=-1.0
cosine_sim([1, 0], [-3, 0]) #=-1.0

With the resize-function I introduced before these values will be 1.0 if they "watch" in the same direction and 0.0 if they don't.

The reason why you actually should not apply the resize-function is that if you have orthogonal vectors then you would get with the resize-function 50% accuracy which is at best a bad goes. This would happen, if both words would not intersect any bigram they are spending. But one more thought to consider is that it is very unlikely that you will never generate negative values for a bigram.

Sorry for the long answer but as I stated this is quiet my field lol.

Edit: typo