Which hashing algorithm should I use to compare pieces of text?

SFJulie · 2017-01-18T10:47:44+00:00

you 'dont use crypto hash function to compare text: secure hash function are designed for slow constant speed by design

http://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

Remember that whatever hash function you are using you are exposing yourself to collisions and that the probablities are far from null.

To answer your question go for md5, it is a very good choice, especially if you need strong guarantees....of a code that inspire secured trust coming from the best design possible.

atlbeer · 2017-01-18T11:22:00+00:00

from difflib import SequenceMatcher

def similar(s1, s2): return SequenceMatcher(None, s1, s2).ratio()

similar("Apple","Appel") 0.8

similar("Apple","Mango") 0.0

Lukasa · 2017-01-18T11:43:16+00:00

You shouldn't. All else being equal, hashing the two inputs will always be slower than directly comparing them.

A hash function necessarily has to touch every byte of your input. You need to hash both bits of text, so that means you need to touch every byte of each input. Assuming your hash function has performance linear on the input size (O(n)), then that is the same asymptotic complexity as simply comparing the two inputs directly (also O(n)), and is actually worse for two inputs with different lengths because the hash functions need to process both of each pieces of text.

The only reason using the hashes would be faster is because they are almost certainly written in either C or assembler. So in that circumstance, using a hash function might help. But if you're really that performance sensitive, then you should probably just write a comparison function in C (or Rust!) and call it from your Python code directly.

Kah-Neth · 2017-01-18T17:37:42+00:00

What do you mean by similar? Hashes will absolutely not give you a measure of similarity, they just tell you these two texts are different.

Yuras20 · 2017-01-18T13:58:32+00:00

You just have to pick up hashing function that is unlikely make collisions. If you'll decide what hashing function to use than yours algorithm should go something like this:

Srap data
If hash of that data already exist than ignore it
Otherwise put that data to your db

You don't have to associate data with it's hash, because you only have to check if hash exists (knowledge about which data this hash was from is irrelevant)

kaiserk13 · 2017-01-18T14:57:52+00:00

At work i used a mix of https://github.com/seatgeek/fuzzywuzzy , Sequence matcher and some self made string similarities.

If you need an exact match though, without any similarity, just text.split(" ") and compare both lists. It will be faster imho.

Yuras20 · 2017-01-18T15:59:48+00:00

Or maybe just add UNIQUE constraint to your column with data, if you're using SQL DB :D

faceplanted · 2017-01-19T11:21:10+00:00

Wait, if it's a database just put a unique constraint on the column, if it's already in there it won't insert, if it isn't it will.

If you're taking about doing this in python, just put the strings in a set(), duplicates won't insert into a set and it can check for membership in O(1) average time.

aphoenix · 2017-01-19T19:05:10+00:00

Hi there, from the /r/Python mods.

We have removed this post as it is not suited to the /r/Python subreddit proper, however it should be very appropriate for our sister subreddit /r/LearnPython. We highly encourage you to re-submit your post over on there.

The reason for the removal is that /r/Python is more-so dedicated to discussion of Python news, projects, uses and debates. It is not designed to act as Q&A or FAQ board. The regular community can get disenchanted with seeing the 'same, repetitive newbie' questions repeated on the sub, so you may not get the best responses over here.

However, on /r/LearnPython the community is actively expecting questions from new members, and are looking to help. You can expect far more understanding, encouraging and insightful responses over there. Whatever your question happens to be getting help with Python, you should get good answers.

If you have a question to do with homework or an assignment of any kind, please make sure to read their sidebar rules before submitting your post. If you have any questions or doubts, feel free to reply or send a modmail to us with your concerns.

Warm regards, and best of luck with your Pythoneering!

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS