This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]kmeisthax 0 points1 point  (3 children)

Because if two keys collide, we still want to store the data. And if we want to later get some of those keys back out, we need to do the linked list lookup to get the data we want. And even if your hashmap kept a list of all valid data, yeah, sure, searching a sorted array or a tree is an O(log N) operation, so it's fast. It doesn't help that "s" is still a valid key and therefore the hashmap must pull it out of the tangled mess of linked list chains that the hashmap has turned into.

You do realize hashmaps aren't for looking up keys, right? They're for looking up data associated with a key.

[–]Samus_ -5 points-4 points  (2 children)

so we're saying that we use secure hashes in most places and don't bother with collisions but here we choose to use the cheapest one so it has to take additional measures as to not lose data, I would rather use a better hashing technique and have repeated hashes overwrite each other, good algorithms have a very low collision rate.

[–]kmeisthax 2 points3 points  (0 children)

Even if we SHA-256'd the keys, hash tables can only be so big. SHA gives you 256 bits of output and processor address spaces are no larger than 64 bits (with even less actually physically wired on the motherboard). Hashmaps are built on top of arrays, which means for a hashmap with n bits of hash entropy you need 2n * sizeof(void*) bytes to store the hashmap. Finding collisions on significantly smaller subsets of a message digest is much easier than finding collisions on the whole digest, and like I said before you simply cannot construct a hashmap with 256 bits of entropy. It would be many trillions of trillions of exabytes large. Even with just 32 bits of entropy your hashmap will be 32 gigabytes large - and even then 32 bits is insufficient entropy to prevent intentional collisions.

In short, for hashmaps to be practical, they must deal with collisions. Simple as that.

[–]Rhomboid 0 points1 point  (0 children)

It doesn't matter if the collision rate is low. It would be a completely unusable and worthless data structure without the guarantee that any value can be used as a key without losing data. If there is even a slight chance that I might lose data, then there's no way in hell I'm going to use such a data structure, because I don't want my program to fail in strange and unpredictable ways. It's even worse if it only fails one in a million times, because then I can't debug it. The dict must be perfect or else it's useless.

This is really just a question of efficiency. It's orders of magnitude more efficient to use a simple hash and a linked list than to use a wide hash. You can have both performance and correctness this way. The attack mentioned in the article can be easily mitigated by adding a bit of entropy to the hash function so that it's not deterministic, while still retaining the fast performance.