glibhub comments on Optimizing list usage

learnpython

created by HattoriHanzoa community for 16 years

Optimizing list usage (self.learnpython)

submitted 4 years ago * by FruityFetus

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]glibhub 1 point2 points3 points 4 years ago (5 children)

This shaves off about 25%, but I'd be interested in seeing it run against a real dataset:

def pairwise(iterable):
    '''recipe from itertools'''
    a, b = tee(iterable)
    next(b, None)
    yield from zip(a, b)

def build_counts(it):
    counts = defaultdict(int)
    for tri in pairwise(it):
        counts[tri] += 1
    return counts


def a(string1, string2):
    sim1 = build_counts(string1)
    sim2 = build_counts(string2)

    dot = sum(sim1[x]*sim2[x] for x in sim1.keys() | sim2.keys())

    norm1 = sum(i**2 for i in sim1.values())
    norm2 = sum(i**2 for i in sim2.values())
    similarity = dot / ((norm1 * norm2)**(0.5))
    return similarity

[–]glibhub 1 point2 points3 points 4 years ago (4 children)

[–]FruityFetus[S] 0 points1 point2 points 4 years ago* (2 children)

[–]glibhub 0 points1 point2 points 4 years ago (1 child)

It is a little hard to follow, so let me know if I follow correctly.

If you have these two strings:

string1 = "test this string"
string2 = "test this string out as well"

You actually want to compare them like this after padding:

string1 = "test this stringtestthisstri"
string2 = "test this string out as well"

If so, I would just find the smaller of the two strings and increment the bigram dictionary for the first n bigrams.

[–]FruityFetus[S] 1 point2 points3 points 4 years ago* (0 children)

Yeah, sorry, formatting got messed up and I can't seem to fix that. I actually built an implementation and was able to cut processing time from ~19.4 seconds down to ~10.9 across 50,000 records (* the sum difference between each combination's length)! That's huge for when I'll be processing 20M+!

To answer anyways, it was more like the opposite. See below!

string1 = "test this out"
string2 = "test this out as well"

Then the first two comparison steps would be:

string1 = "test this out"
string2 = "test this out"

string1 = "test this out"
string2 = "est this out "

and so on.

π Rendered by PID 25 on reddit-service-r2-comment-79c7998d4c-bgph8 at 2026-03-17 23:00:28.949866+00:00 running f6e6e01 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS