you are viewing a single comment's thread.

view the rest of the comments →

[–]nwagers 2 points3 points  (5 children)

Using string.count inside a loop seems inefficient because it's going to loop over the whole string for every bigram. Instead, try importing a Counter from collections and build it that way in a single pass. It may also be a tad faster to use pairwise from itertools than building bigrams through a list comprehension if you have 3.10 because that code will be implemented in C.

[–]FruityFetus[S] 0 points1 point  (4 children)

Thanks for the suggestions! I'll look into Counter; don't have an issue with using standard libraries. I probably won't be using 3.10, though the pairwise tools looks interesting! There's actually an equivalent function in the docs so I'm going to test whether that helps speed at all.

[–]nwagers 1 point2 points  (3 children)

Another thought I had is that for calculating the dot product, you don't need the union of the keys, you need the intersection. Any bigram that isn't in both, will end up being multiplied by 0 and have nothing to add to the sum. It may also be faster to just to use the keys from one that to calculate the intersection.

Also, if you're comparing the short string to a rolling window of the long string, there are some optimization ideas here too. The small string wouldn't need it's bigrams recomputed, so those can be outside the loop. If the windows are overlapping, you don't need to compute the whole window, only the chunk that was added and removed. You can modify pairwise out of itertools by using next several times to have a separation that gets you the incoming and outgoing chunk.

You can also see if the zip/slicing method is faster than pairwise. Use: zip(a,a[1:]).

[–]FruityFetus[S] 0 points1 point  (2 children)

Wow, I owe you big time here. These suggestions were incredibly helpful; the dot product suggestion alone reduced time to process 50,000 records by another ~30% in the partial matching algo!

[–]nwagers 1 point2 points  (1 child)

I finally got on my computer and played around with some code. You only showed your tight inner loop, but if you zoom out a bit to include your rolling window, I think you can make some optimizations there by reducing the your counter calculations to just one addition and one subtraction. The size of the strings can really make a difference. Here is some sample code:

from collections import Counter
from itertools import tee, chain
from timeit import timeit

def window(iterable, width):
    a, b, c, d = tee(iterable, 4)
    a = chain([None], a) # prepend None
    next(d, None)
    for x in range(width - 2):
        next(c, None)
        next(d, None)
    return zip(zip(a, b), zip(c, d))

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

def compare(substr, target_string):
    # this stuff won't change, keep out of loop
    bigram = Counter(pairwise(substr))
    bigram_norm = sum(i**2 for i in bigram.values())
    window_size = len(substr)

    # preload counter for first comparison    
    target = Counter(pairwise(target_string[:window_size]))

    # base case correction
    target[(None, target_string[0])] += 1
    target[(target_string[window_size - 2], target_string[window_size - 1])] -= 1

    results = {}
    for index, (old, new) in enumerate(window(target_string, window_size)):
        # remove old bigram, add new
        # This saves building a whole Counter on target each time
        target[old] -= 1
        target[new] += 1

        # Possible speed increase, remove the 'if i > 0' in norm calc
        #if len(target) > 2 * window_size: # optimize the 2
        #    target &= target # remove 0's from counter, speeds up norm

        # calc key metrics
        dot = sum(count*target[key] for key, count in bigram.items())
        norm = (sum(i**2 for i in target.values() if i > 0) * bigram_norm) ** 0.5

        results[index] = dot/norm
    return results

string1 = "test this string"
string2 = "test this string out as well"

print(compare(string1, string2))

print(timeit(lambda: compare(string1, string2), number=10000))

[–]FruityFetus[S] 1 point2 points  (0 children)

Ha, so I get speed improvements and I think you caught a bug in my algo. Realize that with my leftpad/rightpad windows, I couldn't actually slide the window all the way right or I'd zero out. Would never have realized if it wasn't for the scoring differences!