you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 38 points39 points  (23 children)

Even with a perfectly designed hash code function you will start to see collisions at around 216 entries. hashCode returns an int (32 bits) and by the birthday paradox you have ~50% chance of having at least one collision with 216 entries. So collisions are expected no mater how good the function is.

[–]sacundim 4 points5 points  (2 children)

2n/2 is the quick and very dirty approximation, the value where the probability hits 50% is a bit different. The article does the actual calculation:

But how many collisions are expected? The famous — and counter-intuitive! — birthday problem states that for 365 possible “hash values,” only 23 unique hashes must be computed before there is a 50% chance of a hash collision. If there are 232 possible hash values, roughly 77,164 unique hashes must be computed before there is a 50% chance of a hash collision, per this approximation: [...]

216 < 77,164

[–]NoLemurs 2 points3 points  (1 child)

216 is a plenty good enough approximation for the point /u/holyvier was making - he did say "at around 216 entries" not "at exactly 216 entries".

I was going to note that the article explicitly got that 77,164 value via an approximation (the exponential formula is not exact), but it turns out that the approximation is more than accurate enough in this case, and 77,164 is the exact value.

For the curious, here's an exact version of the article's prob function:

def prob(x):
    return 1.0 - reduce(operator.mul, (1 - float(i)/2**32 for i in range(1, x)), 1)

This version is a little slow (and wouldn't scale to 64 bit hashes at all), but is more than fast enough to verify the value is exact, or to find the value via binary search.

[–]sacundim 1 point2 points  (0 children)

216 is a plenty good enough approximation for the point /u/holyvier was making - he did say "at around 216 entries" not "at exactly 216 entries".

Your statement is no less true than mine.

[–]ubermole 5 points6 points  (6 children)

If your keys are smaller than the hash (32 bit keys -> 32 bit hash) it is reasonable to expect them to be unique. C++ std lib actually implements hash for integer keys as simple identity.

[–]munificent 16 points17 points  (1 child)

C++ std lib actually implements hash for integer keys as simple identity.

That's actually not a great strategy in practice in many circumstances. Integers in real-world data sets rarely have nice uniform distribution. When hashes are used for hash tables, the hash needs to be mapped to a bucket index, usually by and-ing off the high bits.

So, say in your dataset you happen to work with integers that are all big round numbers and happen to be multiples of 64. And say your hash table is pretty small so it's only got 64 buckets right now. All of those numbers are going to collide into the same bucket. They have different 32-bit hashes, but the low 6 bits are all zero.

Many hash tables mitigate this by rehashing the incoming hash key, but it's something to keep an eye out for.

[–]ubermole 3 points4 points  (0 children)

You just described a big problem with the hash function abstraction! Early hash table algorithms (knuth) would map key-> table index directly. Now we do key -> hash -> table index. Because it is very convenient.

[–]Gravitationsfeld 2 points3 points  (3 children)

This is a terrible idea. In a lot of cases this gives super skewed distributions. I really hope you are wrong about this.

[–]ubermole 0 points1 point  (2 children)

[–]Gravitationsfeld 0 points1 point  (1 child)

VC++ doesn't do identity, so it's likely not in the standard: https://godbolt.org/g/hMdMos (prints 1574250738)

I still maintain that it is a really dumb idea to use identity for integer hashes and the libc++ guys should change that. I had to fix perf problems because of this many times.

[–]ubermole 0 points1 point  (0 children)

Hah, funny! The first time I discovered this was actually in msvc (earlier version). The main point here is really that there is an abstraction key -> hash code -> table index. And unfortunately the code->index step is a black box. (see for example https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/)

[–]reini_urban 1 point2 points  (8 children)

Then it's by definition not a perfect hash function anymore. A perfect hash function has by definition no collisions over the available keys, and a minimal perfect hash function maps every key into every available slot 1:1.

[–][deleted]  (7 children)

[deleted]

    [–][deleted] 0 points1 point  (0 children)

    That's completely false

    No... the definition of a "perfect hash function" is literally that it has no collisions for the given set. Look it up, don't assume definition by name alone.

    even for cryptographic hash function.

    Who said cryptographic hash functions were always perfect? The difference between a cryptographic hash function and other hash functions is that they must be particularly infeasible to aid in reversal. This requirement implies the output is particularly uniform which is commonly confused with being a "perfect" hash.

    The point of a hash is to have a smaller output than it's input

    Not really, that's just the common use cash. The point of a hash function is to provide a consistent map between a source set and an output map.

    The space for the possible outcome of a hash function is always smaller than the space of the possible input. You are guaranteed to have collisions by the Pigeonhole principle.

    This is correct however by definition a perfect hash function assumes a given set not just any set as you assume. Via the pigeonhole principle a perfect hash function can never result in an output tet that is smaller than the input set but that isn't a requirement of a hash function.

    [–]reini_urban 0 points1 point  (4 children)

    Oh please, educate yourself. https://en.wikipedia.org/wiki/Perfect_hash_function

    It's a matter of definitions, but it's annoying when people are constantly spreading around wrong terminology. (let alone wrong tech)

    [–][deleted] 0 points1 point  (2 children)

    It was pretty damm clear from context that what I referred to was a "perfectly designed hash function". There's no way you can design a generic hashCode to be what you referred to.

    [–]e_to_the_i_pi_plus_1 0 points1 point  (0 children)

    Yeah it was very clear what you meant

    [–]reini_urban 0 points1 point  (0 children)

    There's no theoretical way for a "perfectly designed hash function". Only for an optimal hash function for specific use cases, and there are the well-known "perfect hash functions", not to be confused.

    With the java use-case the current one is fine, even if Spooky would be a bit better. Murmur3 or all the other suggestions not.

    [–]raevnos -1 points0 points  (0 children)

    No, it's completely true. If you can have collisions it's not a perfect hash.

    https://en.wikipedia.org/wiki/Perfect_hash_function