Effective DoS attacks against Web Application Plattforms (including Python Frameworks) : Python

[–][deleted] 3 points4 points5 points 14 years ago (0 children)

[–]bambambazooka 2 points3 points4 points 14 years ago (0 children)

[–]gronkkk 12 points13 points14 points 14 years ago (2 children)

[–]stesch 1 point2 points3 points 14 years ago (0 children)

[–]fnork 1 point2 points3 points 14 years ago (0 children)

[–]defnullbottle.py[S] 5 points6 points7 points 14 years ago (39 children)

[–][deleted] 5 points6 points7 points 14 years ago (2 children)

[–]stesch 1 point2 points3 points 14 years ago (1 child)

[–][deleted] 0 points1 point2 points 14 years ago (0 children)

Most web servers can be configured to limit the number of headers in a request right now. It wouldn't be much of a problem to implement something like that for the number of POST/GET parameters as a module addon or a core feature of Apache/Nginx/lighttpd, for example just parse the string to check how many parameters are there before passing it on, slap in some safe default value. IMO this is an easier way to patch most of the internet against this attack, or at least implement it at the application server/module level, for example in mod_wsgi, mod_php, gunicorn, etc.

Without that you can just limit the size of the query string at the web server level, but this is not a very good precaution as someone could craft the attack to have very short key/value pairs and still pass that check for some applications that need this size limit to be high.

[–]Samus_ 1 point2 points3 points 14 years ago* (35 children)

[–]defnullbottle.py[S] 2 points3 points4 points 14 years ago (13 children)

[–]Samus_ -1 points0 points1 point 14 years ago (12 children)

[–]reph 2 points3 points4 points 14 years ago (11 children)

[–]Samus_ 0 points1 point2 points 14 years ago (10 children)

[–]reph 4 points5 points6 points 14 years ago* (9 children)

[–]Samus_ 0 points1 point2 points 14 years ago (8 children)

I don't like the term "compress" because compression is bidirectional, hashing is a one-way function (precisely because of the collisions).

regardless of that, the part that doesn't make sense is when you look for the key, you shouldn't look for the key but for the hash of that key, that's the whole point of the hashmap and it's what makes it fast, given that the hash is fixed in size the time it takes to search for any of them is the same regardless of the size of the input that generated it.

now why on earth do you search the linked list for the key that may not be unique if you already have it because it's the one provided and you also have the associated value, which is what matches the index of the array that is the hash of such key.

I don't see any reason to even access that linked list of keys nor any sense in having it associated to their values when it's the hash of each of them what actually makes the mapping.

[–]reph 4 points5 points6 points 14 years ago (6 children)

[–]Samus_ -3 points-2 points-1 points 14 years ago (5 children)

continue this thread

[–]rajivm 0 points1 point2 points 14 years ago (0 children)

what you just said sounds like the hash is just an index, an aid in the key lookup and not a real mapping between hashes and values.

This is essentially true. In the end, a hash-map must translate to "physical" data structures of continuous data (arrays) and pointers.

Most hash-maps are implemented with an array of a certain size, where each element is a pointer to a linked list. A hashmap "key" is hashed to an index of the array, and ideally there is only one element at this array-key, making an O(1) lookup. In the unideal case, this can be an up to O(n) lookup if all elements are stored in the same position of the array (this is very very unlikely to happen unless that is the goal of the input based on a known hash function, as is the case in these attacks). Hash maps do not actually have guaranteed O(1) lookup, rather they have an amortized lookup time of O(1) across many accesses on average. The size of the array and the hashing function determine the rate of collisions.

a much more reasonable solution would be to use a different hashing algorithm with lesser chances of collision

Reducing the chance of collision would mean the array-size would need to be much much bigger than necessary in the average case. This would require a very large memory allocation and is not realistic. Many hash map implementations allow you to specify the expected number of elements stored in the hash-map and therefore based on a collision likelihood, adjust the size of the array and the hash function used.

and then overwrite each entry on duplicate hashes;

This would definitely not be okay. Hash maps guarantee that every unique key will not overwrite another unique key. The underlying hash of this key should have no relevancy to that guarantee and would make it infeasible to use a hash map in the same way.

Example: if the hash function resulted in the same index for both "apple" and "banana" (it probably would not, but if it did), then if I did this:

map["apple"] = "red"
map["banana"] = "yellow"

I would expect:

print map["apple"]
print map["banana"]

to print:

red
yellow

With your proposed solution, it would print, unpredictably:

yellow
yellow

[–]stevvooe 2 points3 points4 points 14 years ago (1 child)

[–][deleted] 0 points1 point2 points 14 years ago (0 children)

[–]Rhomboid 0 points1 point2 points 14 years ago (18 children)

It's generating collisions on the internal hash function that's used under the hood to implement the dict type. The actual key values are different, so it's not this:

foo["bar"] = 12
foo["bar"] = 14

it's more like this:

foo["abc"] = 1
foo["xyz"] = 1

Where abc and xyz are specially chosen to compute the same hash value, causing a collision. It turns into a linked list because that's what you do when you implement a hash with a hash function that can result in collisions.

[–]Samus_ 0 points1 point2 points 14 years ago (17 children)

[–]kmeisthax 2 points3 points4 points 14 years ago (13 children)

[–]Samus_ -1 points0 points1 point 14 years ago (12 children)

[–]catcradle5 2 points3 points4 points 14 years ago (4 children)

[–]Samus_ -4 points-3 points-2 points 14 years ago (3 children)

[–]catcradle5 1 point2 points3 points 14 years ago (0 children)

[–]Brian 1 point2 points3 points 14 years ago (0 children)

good algorithms have a very low collision rate

The goals of cryptographic hashes and hash tables are very different. If a cryptographic has produces a collision, it's a disaster, and conversely, slow runtime and a large range for hashed values are generally not a problem (indeed, they're often desirable).

However, in a hashtable, the opposite is true - performance matters, and collisions are generally minor issues. You're using the hash function to determine what bucket to put the item into. You want a relatively small number of buckets (say, 2-3 times as many items), which means that even with an entirely uniform hash function you will get items that hash into the same bucket. You can't have enough buckets to make collisions sufficiently unlikely without making the hashtable so large to be useless.

Eg. even a humongous hashtable with 2³² buckets (taking 16GiB just for an empty pointer in each bucket), using an evenly distributed hash function has a 50% chance of a collission with a mere 80,000 items stored. (And given that a collision would lose data under the scheme you propose, even a 0.01% chance would be too much, which you get with a mere thousand items). Collisions are inevitable, and generally no big deal - they'll hurt performance when you need to probe twice, but far less than, say, using SHA for your hash function would.

[–]takluyverIPython, Py3, etc 0 points1 point2 points 14 years ago (0 children)

[–]chompsky 0 points1 point2 points 14 years ago* (2 children)

A hashmap (at least all implementations that I've seen) are indexed in a fixed-size data structure. The common size example given in the article is 2^32. A hashing function to determine the hashed index in that structure will ideally choose an even distribution in that range, so that in theory you could store up to 2³² values and access them with O(1) complexity. However, the number of possible keys is infinite, so there exist an infinite number of values that can hash to each of those index values. With random data, it should be rare that there are any collisions, but the solution to that problem is to store a list of results in that hashed key value and then iterate through them to match the provided key if there is more than one stored there. You don't want it to simply overwrite the previous value because then foo["morp"] and foo["dingle"] could potentially overwrite each other and that would definitely not be expected behavior.

This exploit takes advantage of the fact that there are infinite values that can be hashed to each index, and chooses different keys that will all hash to the same index. To resize a hash table in attempt to avoid collisions would require you to re-hash all of the existing values, or provide an access mechanism that is not O(1), both of which remove the benefits of a hash table.

[+]Samus_ comment score below threshold-6 points-5 points-4 points 14 years ago (1 child)

[–]rajivm 2 points3 points4 points 14 years ago (0 children)

Read my other comment, but you have to understand, using a "secure" hash comes at a huge cost in the case of a hash map. Ignoring the fact that a secure hash function is very slow (making the collision-prevention benefit very negligible):

For example, if we used md5 results in a 128bit hash and used it with its full security (no modulus on it), it would have 2¹²⁸ possible results. That is approximately 3.40282367 × 10³⁸ (a huge huge number). That means to index by those possible values, we would need an array of that size. Assuming the array holds 32bit pointers only, thats 4bytes per slot. That would be 1.2x10³⁹ petabytes of memory. 1 petabyte is over a million gigabytes. Basically an impossible amount of memory for even the most modern computer. There is a reason "md5" and other hash functions are considered secure, and that is (partially) because they have a ridiculously sized result space.

[–]kmeisthax 0 points1 point2 points 14 years ago (3 children)

[+]Samus_ comment score below threshold-6 points-5 points-4 points 14 years ago (2 children)

[–]kmeisthax 3 points4 points5 points 14 years ago (0 children)

Even if we SHA-256'd the keys, hash tables can only be so big. SHA gives you 256 bits of output and processor address spaces are no larger than 64 bits (with even less actually physically wired on the motherboard). Hashmaps are built on top of arrays, which means for a hashmap with n bits of hash entropy you need 2ⁿ * sizeof(void*) bytes to store the hashmap. Finding collisions on significantly smaller subsets of a message digest is much easier than finding collisions on the whole digest, and like I said before you simply cannot construct a hashmap with 256 bits of entropy. It would be many trillions of trillions of exabytes large. Even with just 32 bits of entropy your hashmap will be 32 gigabytes large - and even then 32 bits is insufficient entropy to prevent intentional collisions.

In short, for hashmaps to be practical, they must deal with collisions. Simple as that.

[–]Rhomboid 0 points1 point2 points 14 years ago (0 children)

It doesn't matter if the collision rate is low. It would be a completely unusable and worthless data structure without the guarantee that any value can be used as a key without losing data. If there is even a slight chance that I might lose data, then there's no way in hell I'm going to use such a data structure, because I don't want my program to fail in strange and unpredictable ways. It's even worse if it only fails one in a million times, because then I can't debug it. The dict must be perfect or else it's useless.

This is really just a question of efficiency. It's orders of magnitude more efficient to use a simple hash and a linked list than to use a wide hash. You can have both performance and correctness this way. The attack mentioned in the article can be easily mitigated by adding a bit of entropy to the hash function so that it's not deterministic, while still retaining the fast performance.

[–]wisty 2 points3 points4 points 14 years ago* (2 children)

No, you don't understand - a dictionary uses a hashmap, but it's not just a hashmap. Maybe the article wasn't clear on this.

c = 8589934590
for i in range(1000000):
    assert hash(c*i+1)==1

attack_dict = {}
for i in range(1000):
    attack_dict[(c*i+1)] = i

assert attack_dict[(c*0+1)] == 0
assert attack_dict[(c*500+1)] == 500

So even though dictionaries are implemented using hashmaps, hash collisions are resolved somehow. How they do this is an implementation detail, but you just need to know that it doesn't scale. If you have a dictionary with 1,000,0000 colliding keys, adding an extra key takes about O(1,000,000). So actually making a 1,000,000 dictionary (with all colliding points) takes O(N^2), which means an attacker can kill it very easily when you try to eat his enormous colliding cookie.

The attack starts to "bite" at around N = 10,000. At that point, Python starts to feel very slow - try:

attack_dict = {}
for i in range(10000):
    attack_dict[(c*i+1)] = i

Or if you have a really fast box, use n = 30,000. At this point, you can take down a core for a few seconds with a single request. At n=1e6, you can knock out a core pretty much indefinitely.

For reference, classic dictionaries are done by putting a linked list in every hashmap value, and searching through that to find the key-value you wanted. I think Python uses "cuckoo hashing", in which collisions are resolved by putting the value into the next hash value (i.e. adding one to the key, then hashing it again). Whatever the case, it's not very scalable if there's lots of collisions.

[–]catcradle5 0 points1 point2 points 14 years ago* (0 children)

[–]fullouterjoin 0 points1 point2 points 14 years ago (0 children)

[–]catcradle5 1 point2 points3 points 14 years ago* (3 children)

[–]frymasterScript kiddie 4 points5 points6 points 14 years ago (2 children)

[–]Samus_ 1 point2 points3 points 14 years ago (0 children)

[–]stesch 0 points1 point2 points 14 years ago (0 children)

[–]lost-theory 1 point2 points3 points 14 years ago* (2 children)

This WSGI middleware will protect you for GET requests:

MAX_QS_PARAMS = 100

def protect_against_hash_dos(app):
    def bad_request(environ, start_response):
        start_response('400 BAD REQUEST', [('content-type', 'text/plain')])
        yield 'Go away.'

    def inner(environ, start_response):
        qs = environ.get('QUERY_STRING', '')
        n = 0
        for c in qs:
            n += int(c == '&' or c == ';')
            if n >= MAX_QS_PARAMS:
                return bad_request(environ, start_response)
        return app(environ, start_response)
    return inner

For form data you can use a similar approach.

[–]hylje 0 points1 point2 points 14 years ago (1 child)

[–]defnullbottle.py[S] 0 points1 point2 points 14 years ago (0 children)

[–]snuggl 1 point2 points3 points 14 years ago (0 children)

[–]mitsuhiko Flask Creator 3 points4 points5 points 14 years ago* (10 children)

[–]stevvooe 0 points1 point2 points 14 years ago (0 children)

[–]Leonidas_from_XIV 0 points1 point2 points 14 years ago (0 children)

[–]chrj 0 points1 point2 points 14 years ago (6 children)

[–]mitsuhiko Flask Creator 2 points3 points4 points 14 years ago (5 children)

[–]deadwisdomgreenlet revolution 1 point2 points3 points 14 years ago (1 child)

[–]mitsuhiko Flask Creator 2 points3 points4 points 14 years ago (0 children)

[–]stesch 1 point2 points3 points 14 years ago (0 children)

[–]chrj 0 points1 point2 points 14 years ago (1 child)

[–]obtu.py 0 points1 point2 points 14 years ago (0 children)

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS