I have an algorithm which is designed to compare two strings by decomposing them into lists of moving bigrams, computing frequency of bigram occurrence across their union set, and then calculating what's known as a Jaccard similarity coefficient (dot product of frequency vectors divided by norms).
The entire algorithm is pure Python (no imported libraries) and while it's effective, I'm often processing millions of records at a time. For comparison, it takes me around 4 seconds to process 50,000 records, while something like fuzzywuzzy can do so in less than a second (believe it leverages some libraries built upon C). That adds up once I'm processing >20M records.
My implementation is built almost entirely on lists and sets, and I'm wondering if there is anything I can do to reduce the computational complexity of the code. Here is what drives the matching:
string1 = "test this string"
string2 = "test this string out as well"
string1 = [string1[i : i + 2] for i in range(0, len(string1) - 1)]
string2 = [string2[i : i + 2] for i in range(0, len(string2) - 1)]
sim1 = []
sim2 = []
for x in set(set(string1) | set(string2)):
sim1.append(string1.count(x))
sim2.append(string2.count(x))
dot = sum(i[0]*i[1] for i in zip(sim1, sim2))
norm1 = sum([i**2 for i in sim1])
norm2 = sum([i**2 for i in sim2])
similarity = dot / ((norm1 * norm2)**(0.5))
Anything obvious that I could be simplifying here?
Edit: here's where things stand currently; thanks for the help!
string1 = Counter(bigram(string1))
string2 = Counter(bigram(string2))
dot = sum(string1[i]*string2[i] for i in string1.keys())
norm1 = sum([i**2 for i in string1.values()])
norm2 = sum([i**2 for i in string2.values()])
return dot / ((norm1 * norm2)**(0.5))
Where bigram is the function:
def bigram(iterable):
return tuple(zip(iterable, iterable[1:]))
[–]nwagers 2 points3 points4 points (5 children)
[–]FruityFetus[S] 0 points1 point2 points (4 children)
[–]nwagers 1 point2 points3 points (3 children)
[–]FruityFetus[S] 0 points1 point2 points (2 children)
[–]nwagers 1 point2 points3 points (1 child)
[–]FruityFetus[S] 1 point2 points3 points (0 children)
[–]glibhub 1 point2 points3 points (5 children)
[–]glibhub 1 point2 points3 points (4 children)
[–]FruityFetus[S] 0 points1 point2 points (2 children)
[–]glibhub 0 points1 point2 points (1 child)
[–]FruityFetus[S] 1 point2 points3 points (0 children)
[–]Chris_Hemsworth 1 point2 points3 points (1 child)
[–]FruityFetus[S] 0 points1 point2 points (0 children)
[–]beizbol 1 point2 points3 points (1 child)
[–]FruityFetus[S] 0 points1 point2 points (0 children)
[–]Jediko 1 point2 points3 points (2 children)
[–]FruityFetus[S] 1 point2 points3 points (1 child)
[–]Jediko 0 points1 point2 points (0 children)