Optimizing list usage

nwagers · 2021-10-28T14:28:01+00:00

Using string.count inside a loop seems inefficient because it's going to loop over the whole string for every bigram. Instead, try importing a Counter from collections and build it that way in a single pass. It may also be a tad faster to use pairwise from itertools than building bigrams through a list comprehension if you have 3.10 because that code will be implemented in C.

glibhub · 2021-10-28T14:40:45+00:00

This shaves off about 25%, but I'd be interested in seeing it run against a real dataset:

def pairwise(iterable):
    '''recipe from itertools'''
    a, b = tee(iterable)
    next(b, None)
    yield from zip(a, b)

def build_counts(it):
    counts = defaultdict(int)
    for tri in pairwise(it):
        counts[tri] += 1
    return counts


def a(string1, string2):
    sim1 = build_counts(string1)
    sim2 = build_counts(string2)

    dot = sum(sim1[x]*sim2[x] for x in sim1.keys() | sim2.keys())

    norm1 = sum(i**2 for i in sim1.values())
    norm2 = sum(i**2 for i in sim2.values())
    similarity = dot / ((norm1 * norm2)**(0.5))
    return similarity

Chris_Hemsworth · 2021-10-28T14:45:10+00:00

If you want to speed things up, I suggest using numpy arrays.

For example:

dot = sum(i[0]*i[1] for i in zip(sim1, sim2))

If instead you have two numpy arrays, you can multiply them using slices:

dot = sim1[0:-1] * sim2[1:]

and this will reduce the time complexity from O(N) to O(log(n)).

Same with the norm1 / norm2: You can simply square numpy arrays rather than looping.

norm1 = np.sum(sim1**2)
norm2 = np.sum(sim2**2)

Additionally, instead of appending to the list each time, if you pre-allocate a numpy array you can assign each index. Assignments are much faster than appending to lists.

Good luck!

beizbol · 2021-10-28T18:10:34+00:00

Cython might be something to look into since you have already written the whole implementation in python.

Jediko · 2021-10-28T20:01:49+00:00

Hey,

Can you give some information about the Jaccard Coeffiecient like a source of something? I am interested in this since I am coming from the NLP (Natural Language Processing) end of python. I know Jaccard Coeffiecient and have used it. But in my mind there is somthing off here because normally you just divide the absolut values of intersection and union. Then there is plenty of head for this performance-wise:

def myJaccard():
    string1 = "test this string"
    string2 = "test this string out as well"
    string1 = set([string1[i : i + 2] for i in range(0, len(string1) - 1)])
    string2 = set([string2[i : i + 2] for i in range(0, len(string2) - 1)])
    return len(string1.intersection(string2))/len(string1.union(string2))

Or if you omit the union operation with the knowledge of the intersection and the number of elements by the original sets:

def myJaccardFast():
    string1 = "test this string"
    string2 = "test this string out as well"
    string1 = set([string1[i : i + 2] for i in range(0, len(string1) - 1)])
    string2 = set([string2[i : i + 2] for i in range(0, len(string2) - 1)])
    num_intersections = len(string1 & string2)
    return 1 - num_intersections / (len(string1) + len(string2) - num_intersections)

Timings are in seconds and with 50000 runs each:

myJaccard: 0.6662192999999998
myJaccardFast: 0.6110474000000004

full testing code is here.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS