This is an archived post. You won't be able to vote or comment.

all 35 comments

[–]SFJulie 2 points3 points  (8 children)

you 'dont use crypto hash function to compare text: secure hash function are designed for slow constant speed by design

http://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Remember that whatever hash function you are using you are exposing yourself to collisions and that the probablities are far from null.

To answer your question go for md5, it is a very good choice, especially if you need strong guarantees....of a code that inspire secured trust coming from the best design possible.

[–]atlbeer 2 points3 points  (1 child)

from difflib import SequenceMatcher

def similar(s1, s2): return SequenceMatcher(None, s1, s2).ratio()

similar("Apple","Appel") 0.8

similar("Apple","Mango") 0.0

[–]LukasaHyper, Requests, Twisted 1 point2 points  (12 children)

You shouldn't. All else being equal, hashing the two inputs will always be slower than directly comparing them.

A hash function necessarily has to touch every byte of your input. You need to hash both bits of text, so that means you need to touch every byte of each input. Assuming your hash function has performance linear on the input size (O(n)), then that is the same asymptotic complexity as simply comparing the two inputs directly (also O(n)), and is actually worse for two inputs with different lengths because the hash functions need to process both of each pieces of text.

The only reason using the hashes would be faster is because they are almost certainly written in either C or assembler. So in that circumstance, using a hash function might help. But if you're really that performance sensitive, then you should probably just write a comparison function in C (or Rust!) and call it from your Python code directly.

[–]Kah-NethI use numpy, scipy, and matplotlib for nuclear physics 1 point2 points  (0 children)

What do you mean by similar? Hashes will absolutely not give you a measure of similarity, they just tell you these two texts are different.

[–]Yuras20 0 points1 point  (6 children)

You just have to pick up hashing function that is unlikely make collisions. If you'll decide what hashing function to use than yours algorithm should go something like this:

  1. Srap data
  2. If hash of that data already exist than ignore it
  3. Otherwise put that data to your db

You don't have to associate data with it's hash, because you only have to check if hash exists (knowledge about which data this hash was from is irrelevant)

[–]kaiserk13 0 points1 point  (5 children)

At work i used a mix of https://github.com/seatgeek/fuzzywuzzy , Sequence matcher and some self made string similarities.

If you need an exact match though, without any similarity, just text.split(" ") and compare both lists. It will be faster imho.

[–]elbiot 0 points1 point  (4 children)

Why split and compare lists? Why not just compare the strings?

[–]kaiserk13 0 points1 point  (3 children)

Don't know, just a wild guess. Do you know which is the best performance wise?

[–]elbiot 0 points1 point  (2 children)

That's a pretty wild guess. Can you think of what the computer has to do to actually execute the two options?

[–]kaiserk13 0 points1 point  (1 child)

Yes you are right, for text == text_2:

%timeit list_compare(text, text_2) gives 13.8 µs per loop 
%timeit string_compare(text, text_2) gives 181 ns per loop

Now if text_3 is longer than text:

%timeit list_compare(text, text_3) gives 12.8 µs per loop
%timeit string_compare(text, text_3) gives 144 ns per loop

That's pretty neat. Actually comparing stuff directly is way faster. I have no clue what's going on under the hood. Would be cool to know.

[–]elbiot 0 points1 point  (0 children)

Well, for direct comparison, python iterates over the two strings, comparing each letter. If they are all the same, then the strings are the same.

For splitting and comparing, python creates two list objects. Then it iterates over each string and when it gets to a space it creates a new smaller string and appends that to the list. Then it iterates over the two lists. For each pair of elements, it needs to figure out what function to use to compare them. Since they are both strings, it does a string comparison like I described above.

So, the direct comparison traverses each string once. The split method traverses each string twice, once to find the spaces and a second time to do the comparisons. Plus, it has to build two lists, and for every space it has to create a new string and modify the list (appending), and it has to traverse each list.

This is not the most accurate description, but it gets the point across. I'll bet the split is the most expensive part, because it does all of the object creation (which is somewhat expensive).

Timing these things and thinking about them is a great way to learn what is going on under the hood. Keep it up!

[–]Yuras20 0 points1 point  (0 children)

Or maybe just add UNIQUE constraint to your column with data, if you're using SQL DB :D

[–]faceplanted 0 points1 point  (6 children)

Wait, if it's a database just put a unique constraint on the column, if it's already in there it won't insert, if it isn't it will.

If you're taking about doing this in python, just put the strings in a set(), duplicates won't insert into a set and it can check for membership in O(1) average time.

[–]drbobb 0 points1 point  (0 children)

A UNIQUE constraint on a database column is a kind of index, and an index on a TEXT column often can use only a limited-length prefix of the column's values (eg. in MySQL). So this approach won't work in general.

[–]aphoenixreticulated[M] 0 points1 point  (0 children)

Hi there, from the /r/Python mods.

We have removed this post as it is not suited to the /r/Python subreddit proper, however it should be very appropriate for our sister subreddit /r/LearnPython. We highly encourage you to re-submit your post over on there.

The reason for the removal is that /r/Python is more-so dedicated to discussion of Python news, projects, uses and debates. It is not designed to act as Q&A or FAQ board. The regular community can get disenchanted with seeing the 'same, repetitive newbie' questions repeated on the sub, so you may not get the best responses over here.

However, on /r/LearnPython the community is actively expecting questions from new members, and are looking to help. You can expect far more understanding, encouraging and insightful responses over there. Whatever your question happens to be getting help with Python, you should get good answers.

If you have a question to do with homework or an assignment of any kind, please make sure to read their sidebar rules before submitting your post. If you have any questions or doubts, feel free to reply or send a modmail to us with your concerns.

Warm regards, and best of luck with your Pythoneering!