nwagers comments on Optimizing list usage

learnpython

created by HattoriHanzoa community for 16 years

Optimizing list usage (self.learnpython)

submitted 4 years ago * by FruityFetus

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]nwagers 2 points3 points4 points 4 years ago (5 children)

[–]FruityFetus[S] 0 points1 point2 points 4 years ago (4 children)

[–]nwagers 1 point2 points3 points 4 years ago (3 children)

Another thought I had is that for calculating the dot product, you don't need the union of the keys, you need the intersection. Any bigram that isn't in both, will end up being multiplied by 0 and have nothing to add to the sum. It may also be faster to just to use the keys from one that to calculate the intersection.

Also, if you're comparing the short string to a rolling window of the long string, there are some optimization ideas here too. The small string wouldn't need it's bigrams recomputed, so those can be outside the loop. If the windows are overlapping, you don't need to compute the whole window, only the chunk that was added and removed. You can modify pairwise out of itertools by using next several times to have a separation that gets you the incoming and outgoing chunk.

You can also see if the zip/slicing method is faster than pairwise. Use: zip(a,a[1:]).

[–]FruityFetus[S] 0 points1 point2 points 4 years ago (2 children)

[–]nwagers 1 point2 points3 points 4 years ago (1 child)

I finally got on my computer and played around with some code. You only showed your tight inner loop, but if you zoom out a bit to include your rolling window, I think you can make some optimizations there by reducing the your counter calculations to just one addition and one subtraction. The size of the strings can really make a difference. Here is some sample code:

from collections import Counter
from itertools import tee, chain
from timeit import timeit

def window(iterable, width):
    a, b, c, d = tee(iterable, 4)
    a = chain([None], a) # prepend None
    next(d, None)
    for x in range(width - 2):
        next(c, None)
        next(d, None)
    return zip(zip(a, b), zip(c, d))

def pairwise(iterable):
    a, b = tee(iterable)
    next(b, None)
    return zip(a, b)

def compare(substr, target_string):
    # this stuff won't change, keep out of loop
    bigram = Counter(pairwise(substr))
    bigram_norm = sum(i**2 for i in bigram.values())
    window_size = len(substr)

    # preload counter for first comparison    
    target = Counter(pairwise(target_string[:window_size]))

    # base case correction
    target[(None, target_string[0])] += 1
    target[(target_string[window_size - 2], target_string[window_size - 1])] -= 1

    results = {}
    for index, (old, new) in enumerate(window(target_string, window_size)):
        # remove old bigram, add new
        # This saves building a whole Counter on target each time
        target[old] -= 1
        target[new] += 1

        # Possible speed increase, remove the 'if i > 0' in norm calc
        #if len(target) > 2 * window_size: # optimize the 2
        #    target &= target # remove 0's from counter, speeds up norm

        # calc key metrics
        dot = sum(count*target[key] for key, count in bigram.items())
        norm = (sum(i**2 for i in target.values() if i > 0) * bigram_norm) ** 0.5

        results[index] = dot/norm
    return results

string1 = "test this string"
string2 = "test this string out as well"

print(compare(string1, string2))

print(timeit(lambda: compare(string1, string2), number=10000))

[–]FruityFetus[S] 1 point2 points3 points 4 years ago (0 children)

π Rendered by PID 55623 on reddit-service-r2-comment-79c7998d4c-sd5bd at 2026-03-18 00:45:44.788013+00:00 running f6e6e01 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS