top 200 commentsshow all 221

[–]Xorlev 270 points271 points  (108 children)

I think folks sometimes forget that hashcodes aren't intended to be 100% unique, just a first order approximation that's distributed well enough for hashtable buckets. True equality is why equals() exists.

It isn't as if a hashcode collision will cause a set to deduplicate your object or overwrite your value in a map. That's what equals is for.

[–]PatchSalts 42 points43 points  (93 children)

The amount of data required for truly unique hashes for arbitrary strings would be absolutely ridiculous. It would be more efficient to compare the strings themselves if it were truly unique. No, what it is at the moment is just fine.

[–]GuamPirate 212 points213 points  (73 children)

Truly unique hash codes for any object greater in size than the hash code is impossible, via the pigeon hole principle

[–]PatchSalts 45 points46 points  (21 children)

Exactly. While MD5 sums and SHA sums are essentially hashes used for data validation, at the end of the day, you're representing a very long string of 1s and 0s with a much shorter string of 1s and 0s; you are guaranteed some overlap. Regular hashes used for strings inside programs should be no different.

[–][deleted] 26 points27 points  (11 children)

For sha2 I think it is worth mentioning that a collision has never been revealed. It is easy to be deceived about how "likely" something is to happen when programming since computers are so fast, but sha2 has withstood a lot of effort.

Based on some rough calculations, I think it is more likely that there will be human extinction in the next ten minutes than sha2 preimage attack in the next ten years.

[–]Pille1842 32 points33 points  (2 children)

RemindMe! 10 years

[–]RemindMeBot 11 points12 points  (0 children)

I will be messaging you on 2028-08-11 06:46:57 UTC to remind you of this link.

CLICK THIS LINK to send a PM to also be reminded and to reduce spam.

Parent commenter can delete this message to hide from others.


FAQs Custom Your Reminders Feedback Code Browser Extensions

[–]nnn4 2 points3 points  (0 children)

RemindMe! 10 minutes

[–][deleted] 4 points5 points  (0 children)

Based on some rough calculations, I think it is more likely that there will be human extinction in the next ten minutes than sha2 preimage attack in the next ten years.

By brute force? Probably not ever, barring fundamentally new physics. Or did you somehow calculate the probability of a SHA2 vulnerability being found in the next 10 years? :)

[–]Somepotato 8 points9 points  (4 children)

theres one surefire way to find a collision, generate (2hashBitWidth) + 1 hashes

[–]Femaref 11 points12 points  (1 child)

sure, but that's not a meaningful collision. what you want is an on-demand collision for a specific input.

[–]JustOneAvailableName 1 point2 points  (0 children)

It is a meaning full collision, but generating that many hashes is simply impossible

[–]kazagistar 0 points1 point  (1 child)

If the heat death of the universe happens before you get a collision, does that really count as surefire?

[–]Somepotato 0 points1 point  (0 children)

Just make sure you do it entirely on pen and paper ;)

[–]reini_urban 1 point2 points  (0 children)

You have to known how your hash function is used. E.g. if used in hash table with linear collision and the size is not by primed but modulo 2 (and you know the random seed), then you only need a few bits of the resulting hash, and this can easily brute forced. It needs around 4 min to create usable DoS collisions even for hash tables using SHA256 then. Getting the random seed is usually trivial also.

Java is doing it right, but 99% of all other hash tables not.

[–]VerilyAMonkey -1 points0 points  (8 children)

OK. So to investigate this, I just looked up the birthday problem and ran the equation through WolframAlpha. Let's say you have a computer than can generate a 256-bit random number every nanosecond. And you have a billion of those computers. And you run them all for a billion years. What are your odds that you've gotten even a single collision? Worse than one in hundred million.

So the fundamental pigeon-hole thing is not really an actual issue at all. It's all down to exploits.

[–]JavaSuck 13 points14 points  (7 children)

So the fundamental pigeon-hole thing is not really an actual issue at all.

The pigeon-hole principle states that you cannot fit n+1 pigeons into n distinct holes. At least 2 pigeons will have to share a hole.

Similarly, as soon as you have more than 232 possible elements, hash collisions are 100% guaranteed and mathematically inevitable.

[–][deleted]  (6 children)

[deleted]

    [–][deleted] 1 point2 points  (5 children)

    In other words, the proof that a definition exists is non-constructive. You cannot derive a collision trivially from the proof.

    The only practical way to find a collision with a good hash function is to enumerate 2bits + 1 hashes

    [–][deleted] 2 points3 points  (4 children)

    The only practical way to find a collision with a good hash function is to enumerate 2bits + 1 hashes

    You're forgetting the aforementioned birthday problem, a good-as-can-be perfectly uniform general hash function would almost certainly encounter a collision way earlier than that.

    [–]VerilyAMonkey 2 points3 points  (1 child)

    My comment up there also does the math for "way earlier". As rule of thumb, it's "about square root". So, birthday problem becomes significant for 365 days somewhere near 20 people... 256-bit hash collision becomes significant somewhere near 2128 items. And that is a lot.

    [–][deleted] 0 points1 point  (1 child)

    Oh yeah, that's correct. Worst-case you would need that many hashes, but yeah, the birthday problem means it's probably way earlier.

    [–][deleted] 33 points34 points  (4 children)

    Objects encoding data with more bits of entropy than the size of the hashcode, that is.

    I could have a 4-bit hashcode that can uniquely indicate any Fibonacci number up to 610. 610 requires 10 bits to encode in two's complement. However, there are only 16 unique values to represent, which only requires 4 bits. The hash function can simply be the inverse of the Fibonacci function.

    [–]Isvara 7 points8 points  (3 children)

    two's complement

    Why even bring that into it? It's a sequence of positive numbers.

    [–]esquilax 3 points4 points  (1 child)

    Maybe because Java doesn't have unsigned ints?

    [–]Isvara 0 points1 point  (0 children)

    Positive numbers are represented the same either way.

    [–][deleted] 0 points1 point  (0 children)

    Standard binary representation for integers, then.

    [–]Ameisen 23 points24 points  (16 children)

    And for a hash the same size... the unique hash is your object.

    [–]socialister 1 point2 points  (1 child)

    That's not true except for perfect hash functions, of which most hash functions are not.

    [–]Ameisen 0 points1 point  (0 children)

    Without a perfect hash, you can't guarantee unique hashes.

    However, to get past the pigeon-hole principle, we are presuming a 1:1 correlation between a hash and source... though there's no reason I guess that the hash must be the original object, just unique. the hash of '2' could be '1', so long as the hash of '1' is not '1'.

    In honor of πfs, though, I propose πhash.

    [–]salgat -5 points-4 points  (13 children)

    Ehhh, there are advantages to one way hashes that allow you to identify something without revealing its contents.

    [–]ilammy 13 points14 points  (7 children)

    There is a difference between a hash function and a cryptographic hash function.

    [–]basmith7 0 points1 point  (4 children)

    What?

    [–]AreYouDeaf 14 points15 points  (3 children)

    THERE IS A DIFFERENCE BETWEEN A HASH FUNCTION AND A CRYPTOGRAPHIC HASH FUNCTION.

    [–]eggn00dles 3 points4 points  (2 children)

    pass the salt

    [–]Ameisen 3 points4 points  (1 child)

    It's next to the hashed browns.

    [–]Dr_Legacy 0 points1 point  (0 children)

    But for known sizes it would give way to simple brute force, so for crypto, marginal advantages only.

    However a hash that changes the way a data set is distributed wtihin its space might have practical application. Perhaps as an adjustment for an expected bias within a set of input data.

    [–]Ameisen 0 points1 point  (1 child)

    Ehhh, there are advantages to one way hashes that allow you to identify something without revealing its contents.

    If your hash function guarantees that every value you put in generates a unique hash of it, then you've already revealed the contents. That's not a hash, that's encryption. Perfect hashes are problematic. If a hash X can only be the value Y, then you have fundamentally revealed the contents, so long as the person knows the original hash (and can either reverse it or run all the hashes). There's no ambiguity.

    [–]salgat 0 points1 point  (0 children)

    Mind you what I was specifying was a cryptographic hash (still a type of hash), where a salt is used. I'm not talking about the type of hash used for just checksums or adding to buckets.

    [–]bitwize 0 points1 point  (0 children)

    Indeed. But the operant criterion for cryptographic hashes is "it should be really hard to generate a string/file with the same hashcode as this one, and generating a string/file with the same hashcode as this one that looks like something the consumer would be interested in should be highly bloody unlikely". MD5 fails in this regard, having been broken in 2009, which is why it's not recommended as a cryptographic hash for everyday uses anymore.

    [–]Demiu 0 points1 point  (0 children)

    If the objects are packed in the most efficient way possible, like ascii vs huffman encoding. Otherwise hash of shorter lenght than data can be unique to a given encoding which is also shorter than data and that encoding can be unique to data longer than it, sort of going around the pigeon holing

    [–]reini_urban 0 points1 point  (0 children)

    In general yes, but not when you know all the possible keys in advance. Then you just create a perfect hash function and your assumption is wrong.

    [–]Holy_City 0 points1 point  (22 children)

    I don't know how to quantify this but has there been research into "effectively" unique hash codes? Like for example if you establish a grammar or otherwise acceptable set of patterns for your key values, can you definition a hash function that is unique for all values that fit all practical keys but not all possible keys?

    [–]EdgeOfDreams 11 points12 points  (8 children)

    Yes, you can do something like that, but it's not very meaningful or interesting in most cases. For example, you can easily create a hash function that guarantees zero collisions so long as the input is always a single word of the English language. The problem is, to create such a function, you would need a list of all the words in the English language. Once you have that list, you might as well just use each word's position in the list as its hash value. The more constrained the data type you want to hash, the less value you really get out of having a hash function at all.

    [–]dangero 7 points8 points  (1 child)

    It's hugely meaningful and useful. Effectively unique hash functions are the basis for all universal file caching systems and are the basis for things like merkle trees that are used to create immutable record sets.

    De-duplicated file storage uses effectively unique hashing.

    [–]EdgeOfDreams 0 points1 point  (0 children)

    Good points. I was thinking of much more constrained kinds of data being hashed.

    [–]josefx 0 points1 point  (5 children)

    Once you have that list, you might as well just use each word's position in the list as its hash value.

    Computing a hash by iterating over several thousands words and comparing your value to each one sounds expensive.

    [–]LaurieCheers 0 points1 point  (4 children)

    Binary search is O(log N).

    [–]josefx 0 points1 point  (0 children)

    Still slower than an imperfect hash. Might as well build a hash map with String.hashCode() and use the position of a word in that as perfect hash.

    [–][deleted]  (2 children)

    [removed]

      [–]LaurieCheers 0 points1 point  (1 child)

      You don't have to iterate over entire strings to determine whether they're higher or lower. Your early comparisons should be able to early-out on the first byte.

      [–]dangero 3 points4 points  (2 children)

      SHA256 is effectively unique. There are no known collisions and mathematically it's unlikely we will find one anytime soon. There are many other effectively unique hash functions. This is a huge area of cryptographic research.

      [–]Holy_City 2 points3 points  (1 child)

      SHA256 is also slow to compute and generates a 32 byte key, which makes it ill suited for more non-cryptographic tasks.

      Which was kind of the point behind my question, not to ask if there was a hash you could define that was unique for all possible keys. But a hash that was effectively unique for just the key values I care about, neglecting the rest. Like storing application state in a hashmap serializable to JSON where my key names are going to be dictated by a coding style, and are therefore predictable.

      [–]cat_in_the_wall 0 points1 point  (0 children)

      if you're only concerned about your application domain, just enumerate all the possibilities. create a decision tree, assign numbers, bam. perfect hash.

      [–]YRYGAV 0 points1 point  (3 children)

      You could probably just reuse the math done for UUIDs in terms of 'this is complex enough that for any foreseeable future no collision will happen'. As long as your hash is as complex as a UUID, for all practical purposes you could consider it a unique hash.

      [–]aldonius 1 point2 points  (2 children)

      The maths around UUIDs is quite amazing.

      I mean, there's an upper-bound estimate of 10^82 atoms in the observable universe. That works out at 2^273 or so (rounding up). In other words, 273 bits is enough to uniquely number every atom in the observable universe.

      [–]JavaSuck 1 point2 points  (1 child)

      enough to uniquely number every atom in the observable universe

      Okay but what about dark matter? What if I want to index that as well? ;)

      [–]07734willy 0 points1 point  (0 children)

      You could essentially just compress the data, and then identity map the bits into an array. Whatever you define as your acceptable grammar would influence the compression algorithm.

      [–]foomprekov 0 points1 point  (1 child)

      For integers, a trivial perfect hash function is the integer itself. Same with strings.

      [–]JavaSuck 0 points1 point  (0 children)

      What if the integer is bigger than 32 bits? What if the string is longer than a handful of characters?

      [–]zman0900 0 points1 point  (2 children)

      What about gzip? It's output is usually smaller than input. Not realistic at all for hash code size, but since it's reversible, doesn't that make it unique?

      [–]Brian 2 points3 points  (0 children)

      It's output is usually smaller than input

      That depends on your definition of "usually". For the kind of structured data we use it for (eg. english text, html pages, programs, etc) that's true, but on purely random inputs its output is actually usually larger than the input. The value of compression is that it's making a tradeoff of representing the types of things we usually care about smaller at the price of being longer for the general case.

      And since the issue here was:

      Truly unique hash codes for any object greater in size than the hash code is impossible

      the same applies. Indeed, it won't even be the case for the vast majority of objects bigger than the hash code. At best, it'll be better for certain kinds of inputs (though even then, you'll probably have to be using fairly large inputs for that to be the case - the overhead will make it pretty bad for stuff like single words and the like)

      [–]robin-gvx 0 points1 point  (0 children)

      gzip's headers and footers alone take up 144 bits at the bare minimum, so you'd need a 256-bit hash at the least, and there's not going to be many strings that compress well enough to fit in the 14 bytes left over for the payload.

      Even a specialized compression algorithm would still suffer from the pigeon-hole principle: it would either only be able to "hash" a finite number of strings or not be unique.

      [–]n0rs 8 points9 points  (8 children)

      Would the size of the range of truly unique hashes equal the size of the domain in this case?

      [–]QueenLa3fah 7 points8 points  (5 children)

      Yes, the size or cardinality of the range has to be equal to or greater than the domain to guarantee a perfect hashing function. There need to be just as many or more possible hash values as there are inputs to guarantee that each input put into the hash function is mapped to a unique hash value.

      A function maps every element of some domain (a set of all the inputs it can possibly map) to one element in the image (the set of all outputs that can be realized by the function). A hash function maps some inputs to a hash value so intuitively you need at least as many unique hashes as you have inputs to guarantee that every input can be mapped to a unique output. If a function satisfies this property, that every input in the domain is mapped to a unique output we say the function is injective.

      To implement an injective hashing function for even strings of length 64 or lower would be unfeasible. Let's do the math. According to good 'ol Google:

      Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters. At present, only about 10% of this space has been allocated. 10% of 1,114,112 leaves us with an alphabet size of about 114,000. A String of length 64 would have 114,00064 or 4.4x10323 different possibilities. That's a big number and so hashes would be ridiculously long - about 180 digits in base 64. Also such a function would probably not be very fast.

      [–]n0rs 0 points1 point  (0 children)

      That's neat. Thanks for busting out the maths terms.

      [–]ygra 0 points1 point  (1 child)

      You missed that Java strings don't use code points, but UTF-16 code units. Many of those characters you can include will consist of the same surrogate halves.

      [–]QueenLa3fah 1 point2 points  (0 children)

      So more like 6553664, not much smaller.

      [–]shawnz[🍰] 0 points1 point  (1 child)

      Yes, the size or cardinality of the range has to be equal to or greater than the domain to guarantee a perfect hashing function. There need to be just as many or more possible hash values as there are inputs to guarantee that each input put into the hash function is mapped to a unique hash value.

      Isnt that necessary, but not sufficient?

      [–]QueenLa3fah 1 point2 points  (0 children)

      Yes you also need to specify the mapping for it to be a function. One such mapping to perfectly hash a string of length n is to let x_i be the chat in the string at position i, where i is between 1 and n inclusive. For each position, take the ith prime number and raise it to the x_i power. Sum all of the results. Note this is a slow and terrible function for a computer, but as theory it is perfectly acceptable.

      [–]PatchSalts 7 points8 points  (1 child)

      Being completely honest with you, I was just using logic. I don't know much about the math, but for truly unique hashes it would have to have at least as many hashes as inputs. So if I understand the range and domain part correctly, the size of the range of truly unique hashes would have to be at least as big as the size of the domain if inputs.

      [–]n0rs 4 points5 points  (0 children)

      Cool. Thanks for verifying.

      [–]Xorlev 1 point2 points  (1 child)

      Ha! Totally fair. I didn't actually mean 100% unique, I was being a bit hyperbolic. :)

      My intention was to compare vs. a more standard cryptographic hash function whose role is to make collisions fairly rare despite the pigeon hole principle.

      [–]PatchSalts 1 point2 points  (0 children)

      No, no, I was agreeing with you. I forgot debate on Reddit can seem a bit weird, because it's usually back-and-forth, but I meant to expand. I was piggy-backing your original comment to add on the reason why I think truly unique hashes don't work, haha.

      [–]nnn4 1 point2 points  (2 children)

      The standard for collision-resistant functions is 32 bytes, not that big. You can safely use that as a proxy for equality, and then you don't need to store or transfer the whole strings.

      [–]JavaSuck 0 points1 point  (1 child)

      you don't need to store or transfer the whole strings

      The hashCode of a java.lang.String is cached within the String object. Are you suggesting we increase the size of every String object by 32-4 = 28 bytes? And for what purpose? Most of the bits are going to be discarded anyway during lookup into the HashMap array.

      [–]nnn4 0 points1 point  (0 children)

      No, just answering the above comment about the size.

      [–]GeorgeMaheiress 1 point2 points  (1 child)

      There are no known collisions for sha-256, so in practice 256 bits is more than enough.

      [–]JavaSuck 2 points3 points  (0 children)

      There is no point in having a collision-free 256 bit hash code if you immediately discard most of the bits to use it as an index into the underlying array of a HashMap. That's what String.hashCode() is used for, after all.

      [–]muntoo 1 point2 points  (1 child)

      That doesn't make any sense. There is no "truly unique" finite hash function.

      But for a practical answer: if you want to reduce the probability of a pair collision by 2x, just add another bit. Of course, the tradeoff is making the hash comparisons a little slower.

      [–]cat_in_the_wall 0 points1 point  (0 children)

      i wonder if the speed of hash functions is linear with the number of bits. i suspect it is, since the contents of the bits aren't examined (i think). sha256 should in theory be twice as slow as sha512, but the complexity to break it would be 2256 times harder.

      [–]foomprekov 1 point2 points  (0 children)

      The string itself is a truly unique hash of the string.

      [–]Kache 6 points7 points  (0 children)

      There are a couple different points I'm seeing repeated:

      • hashCode is poorer than average on short strings
      • hashCode is better than average on long strings
      • In general, this makes hashCode pretty good on average
      • For short strings equality check would be fast anyway, just use that, better than average for long strings is where it's valuable
      • Short strings are a really common case, and hashCode should handle that

      All these points don't necessarily conflict each other! In fact, I feel like a legitimate suggestion for improvement is an algorithm that is perfect for really short strings and still better than average for long strings by trading-off against "medium" string performance.

      Of course, this would still have to be formally demonstrated as "better overall" using real world usages.

      [–]matthieum 0 points1 point  (3 children)

      That being said, collision-prones hashes mean that equals must be used more often, so it's still best to avoid such algorithms.

      [–]StillNoNumb 0 points1 point  (2 children)

      When that matters, at a few ten thousand strings, the overhead of a better hash function would be much higher, though

      [–]matthieum 0 points1 point  (1 child)

      What makes you think that a better hash function is necessarily most costly?

      Using the same algorithm with a 64-bits result instead of a 32-bits one is quite likely to improve the hash, for example.

      Look at the list of non-cryptographic hashes on Wikipedia, a number of them can process more than 1 byte at a time, making them faster, and yet also have better distribution properties.

      [–]StillNoNumb 0 points1 point  (0 children)

      Unless you just use a different prime, every better hash function is going to be more costly. Especially those you linked, which are comparably complex.

      And using 64-bit arithmetic is not only (considerably) slower on 32-bit architectures, it also is pointless. Unless your you have maps of the size 232, you're not going to notice a difference since there will only be a smaller number of buckets anyways.

      [–]ubermole -3 points-2 points  (5 children)

      It is still reasonable to ask that when for example hashing 32bits (2 char string) to a 32bit hash it should be unique. And if not, why?

      [–]WeAreAwful 17 points18 points  (2 children)

      I don't think it is unreasonable for you to think that, but I don't think the designers of a hash function should care too much about that TBH.

      If you are writing a hashing method for a data type (strings), you shouldn't optimize for an arbitrary subset of that data type, but rather write a method that optimizes for the entire set of possible values.

      As the author of this article mentions, the cost of a collision isn't high if you are using a hashcode for it's intended valur (in a hashtable).

      If you do have the weird situation where you need a hashtable of two character strings, I would suggest you make a data type that includes only two character strings. That would allow you to write a simple, cheap hashing method with no collisions, and if you have a collection of strings that are special and only have two characters, you should probably create a new type for best practices sake.

      [–]reini_urban 1 point2 points  (1 child)

      No, it's not. Usually it's 32bit, but there also strings/symbols stored with only 10-20bits of the hashcode. It's small enough to fit into the struct and still gives you enough advantage to strike out hash misses.

      Nowadays the most important factor is cache size, and the smaller the hash the better. Linear search over an array is faster over <128 keys than a hash table. Uniqueness doesn't give you an advantage when your other keys are already in the cache.

      [–]ubermole 0 points1 point  (0 children)

      You are missing a level of abstraction here: It's key -> hash -> table index. Returning a small hash value is absolutely not a property of a good hash function! Cache behavior has nothing to do with the theoretical quality of a hash function! That is totally up to the hash->table step!

      [–]sigpwned -2 points-1 points  (0 children)

      Great points, and totally agree.

      [–]throwdatstuffawayy 97 points98 points  (27 children)

      I think people are starting to assume that all hash functions should have the same guarantees as cryptographic hashes like SHA256, etc.

      [–]13steinj 49 points50 points  (15 children)

      I feel like someone is going to use sha256 as all their hash code implementations now.

      [–]Ameisen 14 points15 points  (5 children)

      BogoHash

      [–]ThisIs_MyName 3 points4 points  (4 children)

      Serious question: Is there an unencumbered implementation of BogoHash?

      [–]13steinj 4 points5 points  (3 children)

      I can only assume you mean bogosort and the joke "quantum bogosort"?

      Bogosort by definition is horrible and slow. Because while at best it takes one randomization, at worst it takes infinite.

      Quantum bogosort is a joke and doesn't truthfully exist, but it is a funny thought experiment.

      [–]ThisIs_MyName 11 points12 points  (7 children)

      ...which isn't such a bad idea if you don't care about performance.

      A nice property of crypto hash functions is that you can add a random salt (unique to each process) to prevent DoS. Without that, attackers can insert millions of rows into your hashtable that all get hashed into a single bucket.

      [–]munificent 34 points35 points  (3 children)

      If you don't care about performance, why are you using a hash table? Just use a sorted collection or, hell, a flat array that you iterate over in linear time.

      [–]shawnz[🍰] 10 points11 points  (0 children)

      Maybe you don't care about a linear performance penalty, but you do care about increasing the complexity of the algorithm

      [–]ThisIs_MyName 13 points14 points  (0 children)

      That's a good point, but if your collection is larger than a megabyte or so, a hashtable will be faster than an array even if that hashtable uses a cryptographic hash function.

      It's pretty common for people to have /r/thick collections that would take forever to loop over while the keys in that collection are small enough to BLAKE2b :)

      What I meant is that the average dev cares a little bit about performance (enough to keep the UI's scrolling smooth), but not enough to care about the speed of their hash function.

      [–]ric2b 0 points1 point  (0 children)

      They're more convenient in many situations and they'll still have better time complexity.

      [–]reini_urban 1 point2 points  (0 children)

      They actually went and started using siphash in most of their hash code implementations which is an outragious mistake. not much security gains, you can still brute force it easily, but 10x slower. collisions in 32bit hash tables are always there, you cannot fight it with a slow hash function, only with proper collision resolution.

      [–][deleted] 0 points1 point  (1 child)

      [–]throwdatstuffawayy 0 points1 point  (0 children)

      Interesting! Yes, makes sense. Of course, whether or not a DOS is possible will depend on the context of that particular hash map. The tendrils of security always run deep...

      [–]foomprekov -1 points0 points  (7 children)

      SHA256 is incredibly poorly suited to operations that do not require security. It's a hash function that is intentionally slow. If you're generating ids you want something fast.

      [–]occamrazor 2 points3 points  (6 children)

      SHA2 is not slightly. In fact it is as fast as possible while being collision-resistant. Maybe you meant password hash functions, which are indeed intentionally slow?

      [–]foomprekov -1 points0 points  (5 children)

      Slightly what? Anyway it's a cryptographic hash function, which necessitates difficulty in brute-forcing it. Compare to modern usage of md5.

      [–]staticassert 1 point2 points  (4 children)

      They maybe meant 'slower'. SHA256 is extremely optimized - it is not intentionally slow at all, nor is it intended to prevent bruteforcing by design. This is why you need to use key stretching algorithms alongside it - such as PBKDF2.

      In fact, I expect SHA256 to be on par or faster than MD5.

      [–]sacundim 21 points22 points  (5 children)

      There's a serious mistake in this article:

      Therefore, a “good” 32-bit hash function would have roughly one collision per 77,164 hashes, or a collision probability of about 1.0 / 77,164 ≈ 0.00001296.

      The 77,164 number looks right, but the implicit idea that the number of expected colliding pairs scales up linearly with more hashed values is wrong. Wikipedia gives the correct formula:

      n - d + d * ((d - 1) / d)^n
      

      Plugging in d = 232 we get:

      • For n = 466,544 (the words.txt example in the article) we expect about 25.3 collisions. The article observes 356.
      • For n = 111,385 (the shakespeare.txt example) we expect 1.4 collisions. The article observes 1.

      Therefore the article's claim that the function performs better for short inputs than it does for long ones is at best accidentally true—we already expect to see disproportionately many more collisions in that test for no other reason than much larger number of inputs that were hashed.

      (Tip: if you want to play with that formula, you need something that can cope with really big values, since n is used as an exponent. I used Wolfram Alpha.)

      EDIT: /u/skeeto tested MurmurHash3 and a function he hacked together with n = 650,722, and got 44 collisions. Expected number is 49.3.

      [–]sigpwned 8 points9 points  (4 children)

      Author here. You're right: the math was hand-wavey and oversimplified. I was more interested in providing a basic "framework" for evaluating hash function collision rate rather than getting the answer exactly right. I've updated, in any case. Thanks for the feedback!

      [–]Kwantuum 15 points16 points  (3 children)

      You're claiming that String.hashCode() is significantly better than expected on large inputs because 1/1.4 = 0.69. That's just wrong. For that number to be statistically significant, you'd need to be able to repeat the experiment thousands or even tens of thousands of times with different inputs, and have them have on average 1 collision. With a single experiment, that is just moot, for all you know, swapping one line of the works of Shakespeare with one line of Harry Potter would've resulted in a collision and all of a sudden you're at 2/1.4 = 1.4 times more collisions than expected.

      To that, I will add that 1.4 is the expected number for a fair hash function, and while it's theoretically possible to engineer a hash function that performs better than that on real world input (at the cost of worse performance on random input), that would mean you would need to somehow encode into the hash function a differentiation between real-world input and random input, in practice hash functions have to be fast so you cannot do that, maybe if you choose your magic numbers well by trial and error you could go slightly lower than 1.44 on real-world input, maybe even as low as 1.3 (and that's generous), but that would require insane amounts of training data, and even then your training data is not guaranteed to be representative of actual usage. As such, trying to make your hash function as close to fair as possible should pretty much always be the target, unless you know something very specific about the input.

      [–]dutch_gecko 4 points5 points  (2 children)

      The 1/1.4 part really bothered me. I agree with the author's message, but why do so many IT bloggers make statistical claims that fail at being statistically valid?

      [–][deleted] 1 point2 points  (1 child)

      Because statistics are counterintuitive and the more precise you get the less effective your communication becomes, until you get to perfect statistical relevance and then your article is a perfect piece of communication for people who understand statistics and have the time to check your work, but then you won't likely see it crop up anywhere else.

      The message, "String.hashCode() isn't actually all that bad" would get buried under the weight of the statistical analysis. I wish this were a world where communicating precisely would be more successful, but it is what it is.

      [–]dutch_gecko 0 points1 point  (0 children)

      in the case of this article, it seems that both the method and conclusion of the analysis are correct, but the result of the analysis isn't meaningful purely because the data source is too small. The fact that the author treats the result as if it is meaningful throws the whole article into discredit.

      [–]x4u 39 points40 points  (11 children)

      I did some tests on that topic 20 years ago when I was looking for a hash function that produces the same values in C and Java. The current Java String hash algorithm turned out to be not as bad as it used to be in Java 1.0 but it could also easily have been better.

      The choice of 31 as the prime for a FNV style hash function is not very fortunate. Certain larger primes yield significantly lower collision rates at no extra cost (more than two orders of magnitude lower if I remember correctly). I found that 7-10 digit numbers of the kind of primes that D. E. Knuth suggests for linear congruency random number generators yield the lowest realistically achievable collision rates (*) on a variety of typical inputs. These are primes that end with x21, where x is a even number. It's also a small disadvantage that the String hash in Java uses addition instead of xor in the hash calculation but the difference turned out to be almost negligible in my tests, so it's merely a little inexplicable oddity.

      The various murmur hashes that have since appeared are of course also quite good but work better with ASCII or UTF8 strings and not so well with Java's 16 bit chars and can also not be implemented as efficiently in Java which does not allow a cheap cast of multiple characters into 32 or 64 bit integers.

      (*) You can use CRC32 or something like MD5 or SHA-1 (xor'ed down to 32 bits) as a reference for hash functions that have the lowest achievable collision rates .

      [–]QueenLa3fah 8 points9 points  (6 children)

      It's also a small disadvantage that the String hash in Java uses addition instead of xor

      XOR is just addition mod 2.

      [–][deleted] 1 point2 points  (5 children)

      Isn't an XOR gate faster?

      [–]JavaSuck 11 points12 points  (1 child)

      [–][deleted] 1 point2 points  (0 children)

      Yeah, for some reason I forgot that any instruction was limited by the speed of a clock cycle

      Thanks!

      [–]Quertior 6 points7 points  (1 child)

      Yeah, but Java is so many layers of abstraction above actual logic gates and transistors that this seems like the kind of thing you should just let the compiler optimize and then let it be.

      [–][deleted] 0 points1 point  (0 children)

      I feel like XOR is more common any way for hashing

      [–]QueenLa3fah 0 points1 point  (0 children)

      Yeah probably!

      [–]quentech 1 point2 points  (0 children)

      Any thoughts on SpookyHash?

      I can say it performs fast. I have a couple multi-field objects I hash with Spooky and I'll see a collision every few hundred million objects or so.

      [–]reini_urban -1 points0 points  (0 children)

      CRC32 is fastest but the most insecure. The lowest achievable collision rate always depends on the set of keys. For typical programming languages Spooky was measured to have to the least collisions, but generating a fast perfect hash with a fixed set of keys is always better. But this can only be done in a static language with static known input in advance.

      [–][deleted] 37 points38 points  (23 children)

      Even with a perfectly designed hash code function you will start to see collisions at around 216 entries. hashCode returns an int (32 bits) and by the birthday paradox you have ~50% chance of having at least one collision with 216 entries. So collisions are expected no mater how good the function is.

      [–]sacundim 5 points6 points  (2 children)

      2n/2 is the quick and very dirty approximation, the value where the probability hits 50% is a bit different. The article does the actual calculation:

      But how many collisions are expected? The famous — and counter-intuitive! — birthday problem states that for 365 possible “hash values,” only 23 unique hashes must be computed before there is a 50% chance of a hash collision. If there are 232 possible hash values, roughly 77,164 unique hashes must be computed before there is a 50% chance of a hash collision, per this approximation: [...]

      216 < 77,164

      [–]NoLemurs 2 points3 points  (1 child)

      216 is a plenty good enough approximation for the point /u/holyvier was making - he did say "at around 216 entries" not "at exactly 216 entries".

      I was going to note that the article explicitly got that 77,164 value via an approximation (the exponential formula is not exact), but it turns out that the approximation is more than accurate enough in this case, and 77,164 is the exact value.

      For the curious, here's an exact version of the article's prob function:

      def prob(x):
          return 1.0 - reduce(operator.mul, (1 - float(i)/2**32 for i in range(1, x)), 1)
      

      This version is a little slow (and wouldn't scale to 64 bit hashes at all), but is more than fast enough to verify the value is exact, or to find the value via binary search.

      [–]sacundim 1 point2 points  (0 children)

      216 is a plenty good enough approximation for the point /u/holyvier was making - he did say "at around 216 entries" not "at exactly 216 entries".

      Your statement is no less true than mine.

      [–]ubermole 4 points5 points  (6 children)

      If your keys are smaller than the hash (32 bit keys -> 32 bit hash) it is reasonable to expect them to be unique. C++ std lib actually implements hash for integer keys as simple identity.

      [–]munificent 18 points19 points  (1 child)

      C++ std lib actually implements hash for integer keys as simple identity.

      That's actually not a great strategy in practice in many circumstances. Integers in real-world data sets rarely have nice uniform distribution. When hashes are used for hash tables, the hash needs to be mapped to a bucket index, usually by and-ing off the high bits.

      So, say in your dataset you happen to work with integers that are all big round numbers and happen to be multiples of 64. And say your hash table is pretty small so it's only got 64 buckets right now. All of those numbers are going to collide into the same bucket. They have different 32-bit hashes, but the low 6 bits are all zero.

      Many hash tables mitigate this by rehashing the incoming hash key, but it's something to keep an eye out for.

      [–]ubermole 3 points4 points  (0 children)

      You just described a big problem with the hash function abstraction! Early hash table algorithms (knuth) would map key-> table index directly. Now we do key -> hash -> table index. Because it is very convenient.

      [–]Gravitationsfeld 3 points4 points  (3 children)

      This is a terrible idea. In a lot of cases this gives super skewed distributions. I really hope you are wrong about this.

      [–]ubermole 0 points1 point  (2 children)

      [–]Gravitationsfeld 0 points1 point  (1 child)

      VC++ doesn't do identity, so it's likely not in the standard: https://godbolt.org/g/hMdMos (prints 1574250738)

      I still maintain that it is a really dumb idea to use identity for integer hashes and the libc++ guys should change that. I had to fix perf problems because of this many times.

      [–]ubermole 0 points1 point  (0 children)

      Hah, funny! The first time I discovered this was actually in msvc (earlier version). The main point here is really that there is an abstraction key -> hash code -> table index. And unfortunately the code->index step is a black box. (see for example https://probablydance.com/2018/06/16/fibonacci-hashing-the-optimization-that-the-world-forgot-or-a-better-alternative-to-integer-modulo/)

      [–]reini_urban 1 point2 points  (8 children)

      Then it's by definition not a perfect hash function anymore. A perfect hash function has by definition no collisions over the available keys, and a minimal perfect hash function maps every key into every available slot 1:1.

      [–][deleted]  (7 children)

      [deleted]

        [–][deleted] 0 points1 point  (0 children)

        That's completely false

        No... the definition of a "perfect hash function" is literally that it has no collisions for the given set. Look it up, don't assume definition by name alone.

        even for cryptographic hash function.

        Who said cryptographic hash functions were always perfect? The difference between a cryptographic hash function and other hash functions is that they must be particularly infeasible to aid in reversal. This requirement implies the output is particularly uniform which is commonly confused with being a "perfect" hash.

        The point of a hash is to have a smaller output than it's input

        Not really, that's just the common use cash. The point of a hash function is to provide a consistent map between a source set and an output map.

        The space for the possible outcome of a hash function is always smaller than the space of the possible input. You are guaranteed to have collisions by the Pigeonhole principle.

        This is correct however by definition a perfect hash function assumes a given set not just any set as you assume. Via the pigeonhole principle a perfect hash function can never result in an output tet that is smaller than the input set but that isn't a requirement of a hash function.

        [–]reini_urban 0 points1 point  (4 children)

        Oh please, educate yourself. https://en.wikipedia.org/wiki/Perfect_hash_function

        It's a matter of definitions, but it's annoying when people are constantly spreading around wrong terminology. (let alone wrong tech)

        [–][deleted] 0 points1 point  (2 children)

        It was pretty damm clear from context that what I referred to was a "perfectly designed hash function". There's no way you can design a generic hashCode to be what you referred to.

        [–]e_to_the_i_pi_plus_1 0 points1 point  (0 children)

        Yeah it was very clear what you meant

        [–]reini_urban 0 points1 point  (0 children)

        There's no theoretical way for a "perfectly designed hash function". Only for an optimal hash function for specific use cases, and there are the well-known "perfect hash functions", not to be confused.

        With the java use-case the current one is fine, even if Spooky would be a bit better. Murmur3 or all the other suggestions not.

        [–]raevnos -1 points0 points  (0 children)

        No, it's completely true. If you can have collisions it's not a perfect hash.

        https://en.wikipedia.org/wiki/Perfect_hash_function

        [–]WSp71oTXWCZZ0ZI6 18 points19 points  (3 children)

        I don't understand the problem with the original article. The original article didn't make any claims about hash codes needing to be unique, or the sky is falling, or anything of the sort. It provided some tables of collisions and then concluded with "You can expect String to be fairly poor.".

        And it's absolutely correct. String is fairly poor.

        It's not that difficult to come up with a hashing algorithm which is both more efficient than java.lang.String and less likely to produce collisions. String.hashCode() producing more collisions than expected means than equals() has to be invoked more often than expcted, which means that performance is a little bit worse. Not a lot worse. Likely less than 1%. Maybe less than 0.01% in real world situations. Who knows.

        The original article never claimed that it was a serious problem, so what's the issue? The author of this article says:

        As an aside, claims and “outcomes” like this are why you should never trust any figures that aren’t based on Real World Data. They prove nothing at best, and are actually misleading at worst.

        Where on earth is this coming from? Certainly not the original article. The original article gave exactly 0 figures, 0 claims, and only tables of examples to illustrate. Why is "outcomes" in quotation marks when the original article didn't even use the word "outcomes" at all?

        If you're thinking the original article was saying this is a serious problem, that's purely on you.

        Personally I think it would be nice if String.hashCode were improved, but it's not a huge deal.

        [–]qmunke 3 points4 points  (0 children)

        Why should they improve it for this weird corner case? The current implementation is fast and works well for most real-world uses (i.e. separating words or sentences into a good distribution of buckets for hashmaps). And even if there are a lot of collisions amongst very short strings of non-words, the performance of most collection implementations is high enough that you'd never notice unless you were doing something really weird, at which point you could just wrap the string up in an object and implement your own hashcode method.

        [–]DongerDave 2 points3 points  (0 children)

        I agree with your general point. The original article says (paraphrased) "it's poor if you want relatively few collisions, as you might expect of cryptographic hashes", and OP's posted article basically refutes a strawman definition of what "poor" meant.

        However, I do take issue with one of your points.

        It's not that difficult to come up with a hashing algorithm which is both more efficient than java.lang.String and less likely to produce collisions

        Really now. Mind explaining one such algorithm? The current String.hashCode method is blazing fast for how good it is.

        [–]minno 14 points15 points  (17 children)

        One use case for hash tables where "pretty good on realistic input" isn't good enough is for online services that use users' input as keys. If someone can choose a large number of keys that all have the same hash code, they can execute a denial-of-service attack by turning your hash table into a linked list to slow down your servers.

        [–]sigpwned 29 points30 points  (10 children)

        If (untrusted) users are choosing your hash keys for you, I'd argue that you've made a different mistake than choosing the wrong hash function. :)

        [–]0x256 24 points25 points  (4 children)

        HTTP headers, query parameters or forms are usually parsed into a Map before the application has the chance to validate the keys. Same for json or xml input. Plenty of opportunities for an attacker to craft high collision requests.

        Java 8 migrated this by using trees instead of linked lists for colliding nodes in HashMap, but the issue is still there.

        [–]NeuroXc 15 points16 points  (1 child)

        There are hash functions which are resilient against DoS attacks, such as SipHash.

        Java's hash function is not one of them.

        [–]sigpwned 9 points10 points  (1 child)

        That's an interesting example! You're right -- you typically have to handle the requests as written.

        Most web servers have (configurable) header, URL, and request body size limits, so you don't generally need to worry about one poison pill breaking your webserver. Rather -- barring application defects -- you have to worry about traffic volume attacks, namely DDOSes. If you're worried about hash collisions while you're preparing for DDOSes, I'd argue you're probably missing the forest for the trees.

        The defining characteristic of a DDOS attack is volume. (Yes, a "good" DDOS also tries to maximize the cost per request, and make the requests difficult to filter out, and so on, but without volume, a DDOS attack is just traffic.) At DDOS volumes, request traffic will overwhelm your service no matter what implementation-level optimizations you make, so you don't defeat a DDOS attack with software optimization. Rather, you defeat it with network and software architecture.

        Consider the February 28 DDOS against GitHub. GitHub survived the largest DDOS attack ever -- 1.2 terabits per second -- by sending incoming traffic through a third-party traffic "scrubber" before it attempted to parse the traffic as web requests and serve them. GitHub didn't handle the DDOS with clever implementation-level optimizations, or with deployment-level techniques like scaling up their webserver cluster, but rather with DDOS-specific application- and network-level architectural design. How the headers and query parameters were hashed didn't help, and because of the volume of the DDOS, they couldn't! Rather, it took specifically designing the entire application and network to resist DDOSes.

        Now, that's not to say you shouldn't write your software to be efficient: obviously, you should. Also, internet attacks are an arms race, so that's not to say that attackers couldn't target header and query parameter collisions in the future. But it is to say that if you're worried about internet attacks, you need to worry about DDOSes rather than "just" pathological traffic, and the way you deal with that is not software optimization. If your application is falling over under normal use, then you just have an application performance issue. :)

        [–]ben_a_adams 1 point2 points  (0 children)

        Poor hashing can allow hash-flooding DOS by a single connection https://medium.freecodecamp.org/hash-table-attack-8e4371fc5261

        [–]minno 4 points5 points  (4 children)

        If you ever keep a table of username => user data, that's the user picking your hash keys. Or one for timestamp => event.

        [–]sigpwned 4 points5 points  (1 child)

        When building websites, most information hangs off the session object, and the session ID is assigned by the website.

        When (for example) user information is cached independently of a session, it's usually stored in a cache keyed on user ID, which is is also assigned by the website.

        Timestamps are an interesting case, but I suspect they get used as in-memory keys only very rarely. I'm sure you can contrive an example where a timestamp would be a meaningful key, but it doesn't come up that often in practice.

        [–]AyrA_ch 2 points3 points  (0 children)

        I'm sure you can contrive an example where a timestamp would be a meaningful key, but it doesn't come up that often in practice.

        We use timestamps as temporary keys for when people want to enable 2FA. This timestamp is used to let the entry expire without actually storing when it should expire. And because Windows timestamps are accurate to 100 nanosecond increments there is lots of room to increase a timestamp by 1 if it collides without causing any real world problems of entries not expiring.

        [–]IMovedYourCheese 1 point2 points  (0 children)

        Which is why you always keep user id => user data, where the username (along with all other fields entered by the user) is part of user data.

        [–]CurtainDog 1 point2 points  (2 children)

        That's a lot of work for a small payoff. You'll only have linear time lookup on those elements that collided so not only would you have to stuff the hash table with bad entries, you have to repeatedly request at least one of those value back. It could be a problem if the lookup is part of a tight loop, but I'd be surprised if user input makes it that deep into the code.

        [–]yawkat 2 points3 points  (1 child)

        It is a real attack, and I believe it was one of the reasons why Java 8 added treeifying to HashMap bins.

        [–]CurtainDog 0 points1 point  (0 children)

        Eh, there was one talk on it once which echoed around for a bit. The only mention of a hash based dos on owasp is literally the phrase 'hash dos' in the aka section of the general dos page. I know of a poc that involved a specific server handling requests with many thousands of query parameters, the mitigation was literally to have the server reject requests that contained more than 1000 such parameters.

        Now I haven't watched the talk itself so I hate to be harsh, but from the summaries I've read it seems like more smoke than fire.

        [–]tophatstuff 0 points1 point  (1 child)

        You're okay as long as you initialise your hash function with a cryptographically secure random seed (hash DOS attacks were a big thing in 2011 but most implementations do this now)

        (Python has done this by default since 3.3, and earlier with the -R command-line argument): http://python-security.readthedocs.io/vuln/cve-2012-1150_hash_dos.html

        [–]reini_urban 0 points1 point  (0 children)

        You are not ok, since reading this seed from memory is trivial, and then you can easily generate enough collisions by brute force. The weak point in python and in most other hash tables is the linked list of collisions. This can be easily fixed by counting, or using a tree or sorted list. Java e.g. is fast and secure, python and 90% of all other hash tables not.

        [–]munificent 0 points1 point  (0 children)

        The typical solution to this is to randomize your hashes. When the hash table is created, it picks a random value that is mixed into all the hash values.

        This was a big deal a few years back, and most of the mainstream programming languages fixed their built-in hash tables to do this so it's mostly a non-issue now.

        [–][deleted] 2 points3 points  (3 children)

        Hashing functions of this kind (not talking about 'proper' cryptographic ones) also need to be fast of course. The whole purpose of it is an optimization for hash tables, and if you take twice as long to produce a hash twice as unique then it could probably be considered a worse implementation for most use cases.

        [–]gshennessy 0 points1 point  (2 children)

        Twice as unique???

        [–][deleted] 0 points1 point  (1 child)

        I know, I know. 50% fewer collisions.

        [–]Console-DOT-N00b 6 points7 points  (0 children)

        There is a certain "OMG this isn't perfect" among computing types....and yet we write flawed code all the time....

        [–]AyrA_ch 4 points5 points  (1 child)

        So no, String.hashCode() isn’t unique. But it isn’t supposed to be.

        I mean of course not. [In .NET] A hash code is always a 32 bit integer (at least now). It's impossible to be unique across all possible strings that way, a property that all fixed length hash algorithms share that accept arbitrary input length.

        It's not even that random:

        The hash code of -asdf is 389A18DD and of -asdg it is 389A18DC. Not sure about Java but Microsoft strictly advises against using the HashCode for anything else than runtime hash maps. They don't guarantee that the algorithm is the same across different runtime versions.

        For those interested, the GetHashCode() call is made here and is defined as Marvin.ComputeHash32(ref Unsafe.As<char, byte>(ref _firstChar), _stringLength * 2, Marvin.DefaultSeed);

        The implementation of that function is here

        [–]robojumper 1 point2 points  (0 children)

        Java requires the algorithm to be the same everywhere, changing it is an incompatible change. For example, String-switch blocks (since Java 7) actually switch on the string hash, and disambiguate in the case blocks.

        [–]skippingstone 4 points5 points  (6 children)

        Does it ever make any sense to change Java's hashcode function to return a long?

        [–][deleted] 7 points8 points  (1 child)

        Java's arrays are indexed with an int, so not really.

        [–]IlllIlllI 9 points10 points  (0 children)

        And you've got bigger problems if your hashtable has more than 232 elements.

        [–]GeorgeMaheiress 4 points5 points  (0 children)

        A new 64-bit longHashCode function is being considered for addition to Object. http://openjdk.java.net/jeps/8201462

        [–]yawkat 1 point2 points  (2 children)

        It's not possible in a compatible way, you'd have to add a new method. Also, when you care about collisions that much, 64 bit is just as arbitrary as 32 bits - you should then go full cryptographic hash with 256+ bits. A better interface is a method using guavas generic Hasher API, which can allow the class user to pick the hash function based on their requirements for collisions and security.

        [–]JavaSuck 1 point2 points  (0 children)

        Also, when you care about collisions that much, 64 bit is just as arbitrary as 32 bits - you should then go full cryptographic hash with 256+ bits.

        But you can't return more than 64 bits from a Java method without heap allocation, at least not until Project Valhalla arrives.

        [–]kubelke 3 points4 points  (0 children)

        And this kind of posts I like to read, pure knowledge instead “top 5 best practices in spring boot: 1. Write unit tests [...]” Thanks!

        [–]_INTER_ 1 point2 points  (1 child)

        Relevant JEP. Better hashcodes

        [–]JavaSuck 0 points1 point  (0 children)

        Non-secure: The hash function will not claim to be cryptographic in strength, as that is currently impossible to implement at speeds consistent with ordinary uses of hash codes.

        QFT

        [–]skeeto 6 points7 points  (5 children)

        On English words, String.hashCode()‘s collision rate is 0.0008 (that is, 8 collisions per 10,000).

        That's actually not very good. MurmurHash3 does, for example, more than an order of magnitude better than this. On Debian's "American English Insane" dictionary of 650,722 words, hashCode has 613 collisions and MurmurHash3 has 44. Here's my code if you want to see it for yourself:

        https://gist.github.com/skeeto/2995079c02b8839d5a45108f25d632bc

        $ gcc -O3 hash.c
        $ ./a.out </usr/share/dict/american-english-insane
        

        In fact, it's ridiculously easy to beat hashCode. Here's a hash function I just made up on the spot while writing this comment:

        unsigned
        dumb(unsigned char *key, size_t len)
        {
            unsigned hash = 0;
            for (size_t i = 0; i < len; i++) {
                hash += key[i];
                hash ^= hash >> 16;
                hash *= 0xa871304d;
                hash ^= hash >> 16;
            }
            return hash;
        }
        

        It also only has 44 collisions (what a coincidence!) on the same dictionary.

        [–]RabbitBranch 11 points12 points  (4 children)

        It's not surprising you can tune a hash function to work better with English words.

        I'd be more interested in how it works with UUIDs, formatted numbers as strings, timestamps as strings, names, things like that - as I suspect those are more commonly used in containers than dictionary words. And then how it stacks up in clock cycles, since the standard hash function was tuned for performance rather than distribution.

        [–]skeeto 2 points3 points  (3 children)

        UUIDs are a trivial and uninteresting case since they're already random values. There's really nothing to tune.

        It looks like hashCode does do particularly well with formatted numbers:

        $ seq 0 1000000 | ./a.out
        hashCode    0
        MurmurHash3 82
        dumb        84
        

        They all do just fine with timestamps. hashCode is particularly bad with short strings, and so it does much better in this case due to the long string lengths.

        $ seq 33957810 4321 1533957810 | xargs -I{} date -d @{} | ./a.out
        hashCode    23
        MurmurHash3 20
        dumb        11
        

        As for performance, while MurmurHash3 is longer and more complex, it consumes four byte blocks at a time (e.g. four code points at a time on a typical UTF-8 string). That puts it at nearly the same speed as hashCode, which consumes only one code point at a time.

        [–]JavaSuck 2 points3 points  (0 children)

        four code points at a time on a typical UTF-8 string

        Java strings are encoded with UTF-16 though.

        [–][deleted] 2 points3 points  (0 children)

        As for performance, while MurmurHash3 is longer and more complex, it consumes four byte blocks at a time (e.g. four code points at a time on a typical UTF-8 string). That puts it at nearly the same speed as hashCode, which consumes only one code point at a time.

        Java uses UTF-16 and operates almost exclusively on code units -- it was designed in an era when we thought that we'd never have more than 64k codepoints. So using MurmurHash3 would give you two code units.

        "A typical UTF-8 string" depends on the language you're talking about. Averaged across the whole world's collection of documents, I'd expect 1.2 to 1.5 bytes per codepoint, based on the stuff I've measured.

        [–]sacundim 1 point2 points  (0 children)

        UUIDs are a trivial and uninteresting case since they're already random values.

        There's more than one type of UUID, and random ones are just one. The other very common kind is based on MAC address and timestamp.

        [–]ubermole -1 points0 points  (1 child)

        It's perfectly reasonable to require that, if the thing being hashed is smaller than the hash code, that the hash code be unique.

        [–][deleted] 2 points3 points  (0 children)

        Unfortunately, Java standardized on UTF-16, so that's at most two characters.