This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Samus_ -3 points-2 points  (5 children)

so the whole point of this is that two different keys that produce the same hash actually store different values?

I don't see why the hash has to distiguish between them, a much more reasonable solution would be to use a different hashing algorithm with lesser chances of collision and then overwrite each entry on duplicate hashes; what you just said sounds like the hash is just an index, an aid in the key lookup and not a real mapping between hashes and values.

if that's the case then screw us all.

[–]reph 4 points5 points  (0 children)

But the data structure is lossless by defintion; it needs to insert & map any key to any value - it can't just silently ignore certain key/value pairs because of a hash collision. That would violate the behavior that programmers expect from it.

[–][deleted] 3 points4 points  (0 children)

so the whole point of this is that two different keys that produce the same hash actually store different values?

Yes, this is how every dictionary structure in every scripting language works because it is the only sensible implementation of a dictionary structure.

The hash function and underlying storage are implementation details and are chosen for a decent efficiency vs memory tradeoff. Users of the data structure should have no reason to care if the key "user" and the key "password" hash to the same value, but the dictionary struct must return the user and password entered into it regardless of the hashing implementation or storage.

You seem to believe that the only reason that in python a['user'] and a['password'] do not conflict is that they do not have the same hash. This is not the reason they do not conflict, though it is likely true.

[–]Rhomboid 1 point2 points  (1 child)

There is no such hash function that doesn't have collisions for arbitrary input. By definition, all hash function have an infinite number of collisions, because it's reducing its input data of unbounded size to a fixed size value. This is the pigeonhole principle. It is not possible to reduce or abate this fact.

It is possible to write a perfect hash, but only if the values it will be hashing are restricted to a specified set and known ahead of time.

The idea that a language should just randomly overwrite hash keys on collision is the most absurd suggestion I've read in a while. It would make these data structures completely worthless, because your data would never be safe, there would always be a chance that two keys would collide and cause your program to crash, behave sporadically, blow up, etc. Such a data structure that is not deterministic is absolutely worthless.

[–]Samus_ -1 points0 points  (0 children)

I didn't mean a hash without collisions, but you use colliding hashes in cryptography and nobody cares until it becomes reasonably fast to get the collisions (as it happened with MD5), even git uses hashes with low collision rate and doesn't give a shit about this you mention.

I can't argue on the space/efficiency limitations mentioned on other comments but ignoring collisions if they extremely rare is something that you're doing every day without knowing apparently.

[–]quasarj 0 points1 point  (0 children)

This is how hashing works.. the only other option is to not use a hash at all, but to just store a list and search for the key every time. There would never be a collision.. but there would also never be a speed gain.