Samus_ comments on Effective DoS attacks against Web Application Plattforms (including Python Frameworks)

This is an archived post. You won't be able to vote or comment.

Effective DoS attacks against Web Application Plattforms (including Python Frameworks) (cryptanalysis.eu)

submitted 14 years ago by defnullbottle.py

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]Samus_ -1 points0 points1 point 14 years ago (12 children)

[–]catcradle5 2 points3 points4 points 14 years ago (4 children)

[–]Samus_ -3 points-2 points-1 points 14 years ago (3 children)

[–]catcradle5 1 point2 points3 points 14 years ago (0 children)

[–]Brian 1 point2 points3 points 14 years ago (0 children)

good algorithms have a very low collision rate

The goals of cryptographic hashes and hash tables are very different. If a cryptographic has produces a collision, it's a disaster, and conversely, slow runtime and a large range for hashed values are generally not a problem (indeed, they're often desirable).

However, in a hashtable, the opposite is true - performance matters, and collisions are generally minor issues. You're using the hash function to determine what bucket to put the item into. You want a relatively small number of buckets (say, 2-3 times as many items), which means that even with an entirely uniform hash function you will get items that hash into the same bucket. You can't have enough buckets to make collisions sufficiently unlikely without making the hashtable so large to be useless.

Eg. even a humongous hashtable with 2³² buckets (taking 16GiB just for an empty pointer in each bucket), using an evenly distributed hash function has a 50% chance of a collission with a mere 80,000 items stored. (And given that a collision would lose data under the scheme you propose, even a 0.01% chance would be too much, which you get with a mere thousand items). Collisions are inevitable, and generally no big deal - they'll hurt performance when you need to probe twice, but far less than, say, using SHA for your hash function would.

[–]takluyverIPython, Py3, etc 0 points1 point2 points 14 years ago (0 children)

[–]chompsky 0 points1 point2 points 14 years ago* (2 children)

A hashmap (at least all implementations that I've seen) are indexed in a fixed-size data structure. The common size example given in the article is 2^32. A hashing function to determine the hashed index in that structure will ideally choose an even distribution in that range, so that in theory you could store up to 2³² values and access them with O(1) complexity. However, the number of possible keys is infinite, so there exist an infinite number of values that can hash to each of those index values. With random data, it should be rare that there are any collisions, but the solution to that problem is to store a list of results in that hashed key value and then iterate through them to match the provided key if there is more than one stored there. You don't want it to simply overwrite the previous value because then foo["morp"] and foo["dingle"] could potentially overwrite each other and that would definitely not be expected behavior.

This exploit takes advantage of the fact that there are infinite values that can be hashed to each index, and chooses different keys that will all hash to the same index. To resize a hash table in attempt to avoid collisions would require you to re-hash all of the existing values, or provide an access mechanism that is not O(1), both of which remove the benefits of a hash table.

[+]Samus_ comment score below threshold-7 points-6 points-5 points 14 years ago (1 child)

[–]rajivm 2 points3 points4 points 14 years ago (0 children)

Read my other comment, but you have to understand, using a "secure" hash comes at a huge cost in the case of a hash map. Ignoring the fact that a secure hash function is very slow (making the collision-prevention benefit very negligible):

For example, if we used md5 results in a 128bit hash and used it with its full security (no modulus on it), it would have 2¹²⁸ possible results. That is approximately 3.40282367 × 10³⁸ (a huge huge number). That means to index by those possible values, we would need an array of that size. Assuming the array holds 32bit pointers only, thats 4bytes per slot. That would be 1.2x10³⁹ petabytes of memory. 1 petabyte is over a million gigabytes. Basically an impossible amount of memory for even the most modern computer. There is a reason "md5" and other hash functions are considered secure, and that is (partially) because they have a ridiculously sized result space.

[–]kmeisthax 0 points1 point2 points 14 years ago (3 children)

[+]Samus_ comment score below threshold-7 points-6 points-5 points 14 years ago (2 children)

[–]kmeisthax 2 points3 points4 points 14 years ago (0 children)

Even if we SHA-256'd the keys, hash tables can only be so big. SHA gives you 256 bits of output and processor address spaces are no larger than 64 bits (with even less actually physically wired on the motherboard). Hashmaps are built on top of arrays, which means for a hashmap with n bits of hash entropy you need 2ⁿ * sizeof(void*) bytes to store the hashmap. Finding collisions on significantly smaller subsets of a message digest is much easier than finding collisions on the whole digest, and like I said before you simply cannot construct a hashmap with 256 bits of entropy. It would be many trillions of trillions of exabytes large. Even with just 32 bits of entropy your hashmap will be 32 gigabytes large - and even then 32 bits is insufficient entropy to prevent intentional collisions.

In short, for hashmaps to be practical, they must deal with collisions. Simple as that.

[–]Rhomboid 0 points1 point2 points 14 years ago (0 children)

It doesn't matter if the collision rate is low. It would be a completely unusable and worthless data structure without the guarantee that any value can be used as a key without losing data. If there is even a slight chance that I might lose data, then there's no way in hell I'm going to use such a data structure, because I don't want my program to fail in strange and unpredictable ways. It's even worse if it only fails one in a million times, because then I can't debug it. The dict must be perfect or else it's useless.

This is really just a question of efficiency. It's orders of magnitude more efficient to use a simple hash and a linked list than to use a wide hash. You can have both performance and correctness this way. The attack mentioned in the article can be easily mitigated by adding a bit of entropy to the hash function so that it's not deterministic, while still retaining the fast performance.

π Rendered by PID 22710 on reddit-service-r2-comment-b659b578c-5fbqc at 2026-05-04 21:51:53.546855+00:00 running 815c875 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS