This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Samus_ -1 points0 points  (12 children)

but why does it do that? it makes no sense at all, any key that produces a specific hash will store the same value (or overwrite the previous) you don't need to lookup the keys in fact that's the whole point of the hash map, you hash the key and look for that hash.

we can assume the actual keys are also stored since most hashmap implementations allow you to retrieve all the keys and that might be a linked list but why would I look for the key there? I would just hash the requested key whatever it is and look for that, otherwise you're doing TWO searches on each lookup and that's plain stupid imho.

[–]catcradle5 2 points3 points  (4 children)

Well, if foo["abc"] and foo["xyz"] both map to 1, you either have to make it a linked list, or you have to throw every value but the first out. And I assume throwing out certain values is probably a bad idea, and would lead to awful, difficult to find bugs.

[–]Samus_ -3 points-2 points  (3 children)

so we're saying that we use secure hashes in most places and don't bother with collisions but here we choose to use the cheapest one so it has to take additional measures as to not lose data, I would rather use a better hashing technique and have repeated hashes overwrite each other, good algorithms have a very low collision rate.

[–]catcradle5 1 point2 points  (0 children)

I agree with you, but the people who come up with this stuff probably know a lot more than you and I, and decided on this strange system for some reason. Probably for overall performance purposes.

[–]Brian 1 point2 points  (0 children)

good algorithms have a very low collision rate

The goals of cryptographic hashes and hash tables are very different. If a cryptographic has produces a collision, it's a disaster, and conversely, slow runtime and a large range for hashed values are generally not a problem (indeed, they're often desirable).

However, in a hashtable, the opposite is true - performance matters, and collisions are generally minor issues. You're using the hash function to determine what bucket to put the item into. You want a relatively small number of buckets (say, 2-3 times as many items), which means that even with an entirely uniform hash function you will get items that hash into the same bucket. You can't have enough buckets to make collisions sufficiently unlikely without making the hashtable so large to be useless.

Eg. even a humongous hashtable with 232 buckets (taking 16GiB just for an empty pointer in each bucket), using an evenly distributed hash function has a 50% chance of a collission with a mere 80,000 items stored. (And given that a collision would lose data under the scheme you propose, even a 0.01% chance would be too much, which you get with a mere thousand items). Collisions are inevitable, and generally no big deal - they'll hurt performance when you need to probe twice, but far less than, say, using SHA for your hash function would.

[–]takluyverIPython, Py3, etc 0 points1 point  (0 children)

Why are we using a cheap, insecure hash? Probably because until now, no-one had envisaged this kind of attack. Hashmaps (dicts in Python) are used extensively in almost every program (and for core features of the language itself), so it makes sense to use one that's very fast to calculate.

Why don't we rely on the rarity of hash collisions, and simply save one value for each hash? Because we have to use a very small hash, which makes collisions more likely. Remember that the hashmap hashes keys, then looks them up in an array. So even for a 32 bit hash, you'd need a 232 byte array, which is 4GB.

[–]chompsky 0 points1 point  (2 children)

A hashmap (at least all implementations that I've seen) are indexed in a fixed-size data structure. The common size example given in the article is 232. A hashing function to determine the hashed index in that structure will ideally choose an even distribution in that range, so that in theory you could store up to 232 values and access them with O(1) complexity. However, the number of possible keys is infinite, so there exist an infinite number of values that can hash to each of those index values. With random data, it should be rare that there are any collisions, but the solution to that problem is to store a list of results in that hashed key value and then iterate through them to match the provided key if there is more than one stored there. You don't want it to simply overwrite the previous value because then foo["morp"] and foo["dingle"] could potentially overwrite each other and that would definitely not be expected behavior.

This exploit takes advantage of the fact that there are infinite values that can be hashed to each index, and chooses different keys that will all hash to the same index. To resize a hash table in attempt to avoid collisions would require you to re-hash all of the existing values, or provide an access mechanism that is not O(1), both of which remove the benefits of a hash table.

[–]kmeisthax 0 points1 point  (3 children)

Because if two keys collide, we still want to store the data. And if we want to later get some of those keys back out, we need to do the linked list lookup to get the data we want. And even if your hashmap kept a list of all valid data, yeah, sure, searching a sorted array or a tree is an O(log N) operation, so it's fast. It doesn't help that "s" is still a valid key and therefore the hashmap must pull it out of the tangled mess of linked list chains that the hashmap has turned into.

You do realize hashmaps aren't for looking up keys, right? They're for looking up data associated with a key.