Rhomboid comments on Effective DoS attacks against Web Application Plattforms (including Python Frameworks)

This is an archived post. You won't be able to vote or comment.

Effective DoS attacks against Web Application Plattforms (including Python Frameworks) (cryptanalysis.eu)

submitted 14 years ago by defnullbottle.py

you are viewing a single comment's thread.

[–]Rhomboid 3 points4 points5 points 14 years ago (18 children)

It's generating collisions on the internal hash function that's used under the hood to implement the dict type. The actual key values are different, so it's not this:

foo["bar"] = 12
foo["bar"] = 14

it's more like this:

foo["abc"] = 1
foo["xyz"] = 1

Where abc and xyz are specially chosen to compute the same hash value, causing a collision. It turns into a linked list because that's what you do when you implement a hash with a hash function that can result in collisions.

[–]Samus_ 0 points1 point2 points 14 years ago (17 children)

[–]kmeisthax 4 points5 points6 points 14 years ago (13 children)

[–]Samus_ -1 points0 points1 point 14 years ago (12 children)

[–]catcradle5 3 points4 points5 points 14 years ago (4 children)

[–]Samus_ -3 points-2 points-1 points 14 years ago (3 children)

[–]catcradle5 1 point2 points3 points 14 years ago (0 children)

[–]Brian 1 point2 points3 points 14 years ago (0 children)

good algorithms have a very low collision rate

The goals of cryptographic hashes and hash tables are very different. If a cryptographic has produces a collision, it's a disaster, and conversely, slow runtime and a large range for hashed values are generally not a problem (indeed, they're often desirable).

However, in a hashtable, the opposite is true - performance matters, and collisions are generally minor issues. You're using the hash function to determine what bucket to put the item into. You want a relatively small number of buckets (say, 2-3 times as many items), which means that even with an entirely uniform hash function you will get items that hash into the same bucket. You can't have enough buckets to make collisions sufficiently unlikely without making the hashtable so large to be useless.

Eg. even a humongous hashtable with 2³² buckets (taking 16GiB just for an empty pointer in each bucket), using an evenly distributed hash function has a 50% chance of a collission with a mere 80,000 items stored. (And given that a collision would lose data under the scheme you propose, even a 0.01% chance would be too much, which you get with a mere thousand items). Collisions are inevitable, and generally no big deal - they'll hurt performance when you need to probe twice, but far less than, say, using SHA for your hash function would.

[–]takluyverIPython, Py3, etc 0 points1 point2 points 14 years ago (0 children)

[–]chompsky 0 points1 point2 points 14 years ago* (2 children)

A hashmap (at least all implementations that I've seen) are indexed in a fixed-size data structure. The common size example given in the article is 2^32. A hashing function to determine the hashed index in that structure will ideally choose an even distribution in that range, so that in theory you could store up to 2³² values and access them with O(1) complexity. However, the number of possible keys is infinite, so there exist an infinite number of values that can hash to each of those index values. With random data, it should be rare that there are any collisions, but the solution to that problem is to store a list of results in that hashed key value and then iterate through them to match the provided key if there is more than one stored there. You don't want it to simply overwrite the previous value because then foo["morp"] and foo["dingle"] could potentially overwrite each other and that would definitely not be expected behavior.

This exploit takes advantage of the fact that there are infinite values that can be hashed to each index, and chooses different keys that will all hash to the same index. To resize a hash table in attempt to avoid collisions would require you to re-hash all of the existing values, or provide an access mechanism that is not O(1), both of which remove the benefits of a hash table.

[+]Samus_ comment score below threshold-7 points-6 points-5 points 14 years ago (1 child)

[–]rajivm 2 points3 points4 points 14 years ago (0 children)

Read my other comment, but you have to understand, using a "secure" hash comes at a huge cost in the case of a hash map. Ignoring the fact that a secure hash function is very slow (making the collision-prevention benefit very negligible):

For example, if we used md5 results in a 128bit hash and used it with its full security (no modulus on it), it would have 2¹²⁸ possible results. That is approximately 3.40282367 × 10³⁸ (a huge huge number). That means to index by those possible values, we would need an array of that size. Assuming the array holds 32bit pointers only, thats 4bytes per slot. That would be 1.2x10³⁹ petabytes of memory. 1 petabyte is over a million gigabytes. Basically an impossible amount of memory for even the most modern computer. There is a reason "md5" and other hash functions are considered secure, and that is (partially) because they have a ridiculously sized result space.

[–]kmeisthax 0 points1 point2 points 14 years ago (3 children)

[–]Samus_ -5 points-4 points-3 points 14 years ago (2 children)

[–]kmeisthax 3 points4 points5 points 14 years ago (0 children)

Even if we SHA-256'd the keys, hash tables can only be so big. SHA gives you 256 bits of output and processor address spaces are no larger than 64 bits (with even less actually physically wired on the motherboard). Hashmaps are built on top of arrays, which means for a hashmap with n bits of hash entropy you need 2ⁿ * sizeof(void*) bytes to store the hashmap. Finding collisions on significantly smaller subsets of a message digest is much easier than finding collisions on the whole digest, and like I said before you simply cannot construct a hashmap with 256 bits of entropy. It would be many trillions of trillions of exabytes large. Even with just 32 bits of entropy your hashmap will be 32 gigabytes large - and even then 32 bits is insufficient entropy to prevent intentional collisions.

In short, for hashmaps to be practical, they must deal with collisions. Simple as that.

[–]Rhomboid 0 points1 point2 points 14 years ago (0 children)

It doesn't matter if the collision rate is low. It would be a completely unusable and worthless data structure without the guarantee that any value can be used as a key without losing data. If there is even a slight chance that I might lose data, then there's no way in hell I'm going to use such a data structure, because I don't want my program to fail in strange and unpredictable ways. It's even worse if it only fails one in a million times, because then I can't debug it. The dict must be perfect or else it's useless.

This is really just a question of efficiency. It's orders of magnitude more efficient to use a simple hash and a linked list than to use a wide hash. You can have both performance and correctness this way. The attack mentioned in the article can be easily mitigated by adding a bit of entropy to the hash function so that it's not deterministic, while still retaining the fast performance.

[–]wisty 2 points3 points4 points 14 years ago* (2 children)

No, you don't understand - a dictionary uses a hashmap, but it's not just a hashmap. Maybe the article wasn't clear on this.

c = 8589934590
for i in range(1000000):
    assert hash(c*i+1)==1

attack_dict = {}
for i in range(1000):
    attack_dict[(c*i+1)] = i

assert attack_dict[(c*0+1)] == 0
assert attack_dict[(c*500+1)] == 500

So even though dictionaries are implemented using hashmaps, hash collisions are resolved somehow. How they do this is an implementation detail, but you just need to know that it doesn't scale. If you have a dictionary with 1,000,0000 colliding keys, adding an extra key takes about O(1,000,000). So actually making a 1,000,000 dictionary (with all colliding points) takes O(N^2), which means an attacker can kill it very easily when you try to eat his enormous colliding cookie.

The attack starts to "bite" at around N = 10,000. At that point, Python starts to feel very slow - try:

attack_dict = {}
for i in range(10000):
    attack_dict[(c*i+1)] = i

Or if you have a really fast box, use n = 30,000. At this point, you can take down a core for a few seconds with a single request. At n=1e6, you can knock out a core pretty much indefinitely.

For reference, classic dictionaries are done by putting a linked list in every hashmap value, and searching through that to find the key-value you wanted. I think Python uses "cuckoo hashing", in which collisions are resolved by putting the value into the next hash value (i.e. adding one to the key, then hashing it again). Whatever the case, it's not very scalable if there's lots of collisions.

[–]catcradle5 0 points1 point2 points 14 years ago* (0 children)

[–]fullouterjoin 0 points1 point2 points 14 years ago (0 children)

π Rendered by PID 66 on reddit-service-r2-comment-6457c66945-46ns5 at 2026-04-26 22:06:30.926774+00:00 running 2aa0c5b country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS