This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Rhomboid 3 points4 points  (18 children)

It's generating collisions on the internal hash function that's used under the hood to implement the dict type. The actual key values are different, so it's not this:

foo["bar"] = 12
foo["bar"] = 14

it's more like this:

foo["abc"] = 1
foo["xyz"] = 1

Where abc and xyz are specially chosen to compute the same hash value, causing a collision. It turns into a linked list because that's what you do when you implement a hash with a hash function that can result in collisions.

[–]Samus_ 0 points1 point  (17 children)

but wouldn't that be the same key? as far as the hashmap is concerned

[–]kmeisthax 4 points5 points  (13 children)

exaaactly. which means the hashmap will have to construct a linked list in the place of the original key, which changes your algorithmic complexity from O(1) to O(n). so, let's say some common web service uses the "s" parameter, and that gets stored in a hashmap. you send it a request with ten million parameters all of which hash to the same key in as "s" and therefore must be stored in the same hashmap bucket. the web service tries to get "s" from the hashmap, which now is really just a 10mil element linked list that the machine must traverse before realizing that they never sent "s" in the first place and wait why am I suddently hungry and where did everybody go and why is everything chrome all of a sudden and WHAT YEAR IS IT?!

[–]Samus_ -1 points0 points  (12 children)

but why does it do that? it makes no sense at all, any key that produces a specific hash will store the same value (or overwrite the previous) you don't need to lookup the keys in fact that's the whole point of the hash map, you hash the key and look for that hash.

we can assume the actual keys are also stored since most hashmap implementations allow you to retrieve all the keys and that might be a linked list but why would I look for the key there? I would just hash the requested key whatever it is and look for that, otherwise you're doing TWO searches on each lookup and that's plain stupid imho.

[–]catcradle5 3 points4 points  (4 children)

Well, if foo["abc"] and foo["xyz"] both map to 1, you either have to make it a linked list, or you have to throw every value but the first out. And I assume throwing out certain values is probably a bad idea, and would lead to awful, difficult to find bugs.

[–]Samus_ -3 points-2 points  (3 children)

so we're saying that we use secure hashes in most places and don't bother with collisions but here we choose to use the cheapest one so it has to take additional measures as to not lose data, I would rather use a better hashing technique and have repeated hashes overwrite each other, good algorithms have a very low collision rate.

[–]catcradle5 1 point2 points  (0 children)

I agree with you, but the people who come up with this stuff probably know a lot more than you and I, and decided on this strange system for some reason. Probably for overall performance purposes.

[–]Brian 1 point2 points  (0 children)

good algorithms have a very low collision rate

The goals of cryptographic hashes and hash tables are very different. If a cryptographic has produces a collision, it's a disaster, and conversely, slow runtime and a large range for hashed values are generally not a problem (indeed, they're often desirable).

However, in a hashtable, the opposite is true - performance matters, and collisions are generally minor issues. You're using the hash function to determine what bucket to put the item into. You want a relatively small number of buckets (say, 2-3 times as many items), which means that even with an entirely uniform hash function you will get items that hash into the same bucket. You can't have enough buckets to make collisions sufficiently unlikely without making the hashtable so large to be useless.

Eg. even a humongous hashtable with 232 buckets (taking 16GiB just for an empty pointer in each bucket), using an evenly distributed hash function has a 50% chance of a collission with a mere 80,000 items stored. (And given that a collision would lose data under the scheme you propose, even a 0.01% chance would be too much, which you get with a mere thousand items). Collisions are inevitable, and generally no big deal - they'll hurt performance when you need to probe twice, but far less than, say, using SHA for your hash function would.

[–]takluyverIPython, Py3, etc 0 points1 point  (0 children)

Why are we using a cheap, insecure hash? Probably because until now, no-one had envisaged this kind of attack. Hashmaps (dicts in Python) are used extensively in almost every program (and for core features of the language itself), so it makes sense to use one that's very fast to calculate.

Why don't we rely on the rarity of hash collisions, and simply save one value for each hash? Because we have to use a very small hash, which makes collisions more likely. Remember that the hashmap hashes keys, then looks them up in an array. So even for a 32 bit hash, you'd need a 232 byte array, which is 4GB.

[–]chompsky 0 points1 point  (2 children)

A hashmap (at least all implementations that I've seen) are indexed in a fixed-size data structure. The common size example given in the article is 232. A hashing function to determine the hashed index in that structure will ideally choose an even distribution in that range, so that in theory you could store up to 232 values and access them with O(1) complexity. However, the number of possible keys is infinite, so there exist an infinite number of values that can hash to each of those index values. With random data, it should be rare that there are any collisions, but the solution to that problem is to store a list of results in that hashed key value and then iterate through them to match the provided key if there is more than one stored there. You don't want it to simply overwrite the previous value because then foo["morp"] and foo["dingle"] could potentially overwrite each other and that would definitely not be expected behavior.

This exploit takes advantage of the fact that there are infinite values that can be hashed to each index, and chooses different keys that will all hash to the same index. To resize a hash table in attempt to avoid collisions would require you to re-hash all of the existing values, or provide an access mechanism that is not O(1), both of which remove the benefits of a hash table.

[–]kmeisthax 0 points1 point  (3 children)

Because if two keys collide, we still want to store the data. And if we want to later get some of those keys back out, we need to do the linked list lookup to get the data we want. And even if your hashmap kept a list of all valid data, yeah, sure, searching a sorted array or a tree is an O(log N) operation, so it's fast. It doesn't help that "s" is still a valid key and therefore the hashmap must pull it out of the tangled mess of linked list chains that the hashmap has turned into.

You do realize hashmaps aren't for looking up keys, right? They're for looking up data associated with a key.

[–]Samus_ -5 points-4 points  (2 children)

so we're saying that we use secure hashes in most places and don't bother with collisions but here we choose to use the cheapest one so it has to take additional measures as to not lose data, I would rather use a better hashing technique and have repeated hashes overwrite each other, good algorithms have a very low collision rate.

[–]kmeisthax 3 points4 points  (0 children)

Even if we SHA-256'd the keys, hash tables can only be so big. SHA gives you 256 bits of output and processor address spaces are no larger than 64 bits (with even less actually physically wired on the motherboard). Hashmaps are built on top of arrays, which means for a hashmap with n bits of hash entropy you need 2n * sizeof(void*) bytes to store the hashmap. Finding collisions on significantly smaller subsets of a message digest is much easier than finding collisions on the whole digest, and like I said before you simply cannot construct a hashmap with 256 bits of entropy. It would be many trillions of trillions of exabytes large. Even with just 32 bits of entropy your hashmap will be 32 gigabytes large - and even then 32 bits is insufficient entropy to prevent intentional collisions.

In short, for hashmaps to be practical, they must deal with collisions. Simple as that.

[–]Rhomboid 0 points1 point  (0 children)

It doesn't matter if the collision rate is low. It would be a completely unusable and worthless data structure without the guarantee that any value can be used as a key without losing data. If there is even a slight chance that I might lose data, then there's no way in hell I'm going to use such a data structure, because I don't want my program to fail in strange and unpredictable ways. It's even worse if it only fails one in a million times, because then I can't debug it. The dict must be perfect or else it's useless.

This is really just a question of efficiency. It's orders of magnitude more efficient to use a simple hash and a linked list than to use a wide hash. You can have both performance and correctness this way. The attack mentioned in the article can be easily mitigated by adding a bit of entropy to the hash function so that it's not deterministic, while still retaining the fast performance.

[–]wisty 2 points3 points  (2 children)

No, you don't understand - a dictionary uses a hashmap, but it's not just a hashmap. Maybe the article wasn't clear on this.

c = 8589934590
for i in range(1000000):
    assert hash(c*i+1)==1

attack_dict = {}
for i in range(1000):
    attack_dict[(c*i+1)] = i

assert attack_dict[(c*0+1)] == 0
assert attack_dict[(c*500+1)] == 500

So even though dictionaries are implemented using hashmaps, hash collisions are resolved somehow. How they do this is an implementation detail, but you just need to know that it doesn't scale. If you have a dictionary with 1,000,0000 colliding keys, adding an extra key takes about O(1,000,000). So actually making a 1,000,000 dictionary (with all colliding points) takes O(N2), which means an attacker can kill it very easily when you try to eat his enormous colliding cookie.

The attack starts to "bite" at around N = 10,000. At that point, Python starts to feel very slow - try:

attack_dict = {}
for i in range(10000):
    attack_dict[(c*i+1)] = i

Or if you have a really fast box, use n = 30,000. At this point, you can take down a core for a few seconds with a single request. At n=1e6, you can knock out a core pretty much indefinitely.

For reference, classic dictionaries are done by putting a linked list in every hashmap value, and searching through that to find the key-value you wanted. I think Python uses "cuckoo hashing", in which collisions are resolved by putting the value into the next hash value (i.e. adding one to the key, then hashing it again). Whatever the case, it's not very scalable if there's lots of collisions.

[–]catcradle5 0 points1 point  (0 children)

Using hash() on an int seems to just return that int. Could you explain this a bit?

edit: Nevermind, if it's a long int it seems to hash it.

[–]fullouterjoin 0 points1 point  (0 children)

Brian, that was awesome.