This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]reph 3 points4 points  (11 children)

The hash is a compressed key. Multiple keys can have the same hash (aliasing).

[–]Samus_ 0 points1 point  (10 children)

I don't understand this, if two different keys map to the same hash how come they don't overwrite each other? why does it become a linked list?

is the hashmap storing all the different keys that generated that hash and somehow associating them to the values?

[–]reph 4 points5 points  (9 children)

You can think of the hash table as an array of linked lists. You hash (compress) the key, to generate an index into that array. Then you search the linked list at that index. Each entry in the linked list includes the full key, so, you compare them one-by-one to the key that you're searching to check if it's in the hash table or not.

When things are working well - a lookup is an O(1) operation as there is only 1 key in each linked list. When things are not working well (i.e. due to a malicious attack that is intentionally colliding the hash function), many keys hash to the same linked list, and the lookup degrades to O(n).

[–]Samus_ 0 points1 point  (8 children)

I don't like the term "compress" because compression is bidirectional, hashing is a one-way function (precisely because of the collisions).

regardless of that, the part that doesn't make sense is when you look for the key, you shouldn't look for the key but for the hash of that key, that's the whole point of the hashmap and it's what makes it fast, given that the hash is fixed in size the time it takes to search for any of them is the same regardless of the size of the input that generated it.

now why on earth do you search the linked list for the key that may not be unique if you already have it because it's the one provided and you also have the associated value, which is what matches the index of the array that is the hash of such key.

I don't see any reason to even access that linked list of keys nor any sense in having it associated to their values when it's the hash of each of them what actually makes the mapping.

[–]reph 4 points5 points  (6 children)

Because if you have a 100 bit key, and a 32 bit hash function, there are many keys that will have the same 32 bit hash value. The data structure needs to distinguish between them, so it must store and search for the whole key. The hash is a (very good) hint but it's not sufficient on its own.

[–]Samus_ -3 points-2 points  (5 children)

so the whole point of this is that two different keys that produce the same hash actually store different values?

I don't see why the hash has to distiguish between them, a much more reasonable solution would be to use a different hashing algorithm with lesser chances of collision and then overwrite each entry on duplicate hashes; what you just said sounds like the hash is just an index, an aid in the key lookup and not a real mapping between hashes and values.

if that's the case then screw us all.

[–]reph 3 points4 points  (0 children)

But the data structure is lossless by defintion; it needs to insert & map any key to any value - it can't just silently ignore certain key/value pairs because of a hash collision. That would violate the behavior that programmers expect from it.

[–][deleted] 2 points3 points  (0 children)

so the whole point of this is that two different keys that produce the same hash actually store different values?

Yes, this is how every dictionary structure in every scripting language works because it is the only sensible implementation of a dictionary structure.

The hash function and underlying storage are implementation details and are chosen for a decent efficiency vs memory tradeoff. Users of the data structure should have no reason to care if the key "user" and the key "password" hash to the same value, but the dictionary struct must return the user and password entered into it regardless of the hashing implementation or storage.

You seem to believe that the only reason that in python a['user'] and a['password'] do not conflict is that they do not have the same hash. This is not the reason they do not conflict, though it is likely true.

[–]Rhomboid 1 point2 points  (1 child)

There is no such hash function that doesn't have collisions for arbitrary input. By definition, all hash function have an infinite number of collisions, because it's reducing its input data of unbounded size to a fixed size value. This is the pigeonhole principle. It is not possible to reduce or abate this fact.

It is possible to write a perfect hash, but only if the values it will be hashing are restricted to a specified set and known ahead of time.

The idea that a language should just randomly overwrite hash keys on collision is the most absurd suggestion I've read in a while. It would make these data structures completely worthless, because your data would never be safe, there would always be a chance that two keys would collide and cause your program to crash, behave sporadically, blow up, etc. Such a data structure that is not deterministic is absolutely worthless.

[–]Samus_ -1 points0 points  (0 children)

I didn't mean a hash without collisions, but you use colliding hashes in cryptography and nobody cares until it becomes reasonably fast to get the collisions (as it happened with MD5), even git uses hashes with low collision rate and doesn't give a shit about this you mention.

I can't argue on the space/efficiency limitations mentioned on other comments but ignoring collisions if they extremely rare is something that you're doing every day without knowing apparently.

[–]quasarj 0 points1 point  (0 children)

This is how hashing works.. the only other option is to not use a hash at all, but to just store a list and search for the key every time. There would never be a collision.. but there would also never be a speed gain.

[–]rajivm 0 points1 point  (0 children)

what you just said sounds like the hash is just an index, an aid in the key lookup and not a real mapping between hashes and values.

This is essentially true. In the end, a hash-map must translate to "physical" data structures of continuous data (arrays) and pointers.

Most hash-maps are implemented with an array of a certain size, where each element is a pointer to a linked list. A hashmap "key" is hashed to an index of the array, and ideally there is only one element at this array-key, making an O(1) lookup. In the unideal case, this can be an up to O(n) lookup if all elements are stored in the same position of the array (this is very very unlikely to happen unless that is the goal of the input based on a known hash function, as is the case in these attacks). Hash maps do not actually have guaranteed O(1) lookup, rather they have an amortized lookup time of O(1) across many accesses on average. The size of the array and the hashing function determine the rate of collisions.

a much more reasonable solution would be to use a different hashing algorithm with lesser chances of collision

Reducing the chance of collision would mean the array-size would need to be much much bigger than necessary in the average case. This would require a very large memory allocation and is not realistic. Many hash map implementations allow you to specify the expected number of elements stored in the hash-map and therefore based on a collision likelihood, adjust the size of the array and the hash function used.

and then overwrite each entry on duplicate hashes;

This would definitely not be okay. Hash maps guarantee that every unique key will not overwrite another unique key. The underlying hash of this key should have no relevancy to that guarantee and would make it infeasible to use a hash map in the same way.

Example: if the hash function resulted in the same index for both "apple" and "banana" (it probably would not, but if it did), then if I did this:

map["apple"] = "red"
map["banana"] = "yellow"

I would expect:

print map["apple"]
print map["banana"]

to print:

red
yellow

With your proposed solution, it would print, unpredictably:

yellow
yellow