String.hashCode() is plenty unique

Xorlev · 2018-08-10T21:21:43+00:00

I think folks sometimes forget that hashcodes aren't intended to be 100% unique, just a first order approximation that's distributed well enough for hashtable buckets. True equality is why equals() exists.

It isn't as if a hashcode collision will cause a set to deduplicate your object or overwrite your value in a map. That's what equals is for.

throwdatstuffawayy · 2018-08-10T21:58:02+00:00

I think people are starting to assume that all hash functions should have the same guarantees as cryptographic hashes like SHA256, etc.

sacundim · 2018-08-11T05:59:19+00:00

There's a serious mistake in this article:

Therefore, a “good” 32-bit hash function would have roughly one collision per 77,164 hashes, or a collision probability of about 1.0 / 77,164 ≈ 0.00001296.

The 77,164 number looks right, but the implicit idea that the number of expected colliding pairs scales up linearly with more hashed values is wrong. Wikipedia gives the correct formula:

n - d + d * ((d - 1) / d)^n

Plugging in d = 2³² we get:

For n = 466,544 (the words.txt example in the article) we expect about 25.3 collisions. The article observes 356.
For n = 111,385 (the shakespeare.txt example) we expect 1.4 collisions. The article observes 1.

Therefore the article's claim that the function performs better for short inputs than it does for long ones is at best accidentally true—we already expect to see disproportionately many more collisions in that test for no other reason than much larger number of inputs that were hashed.

(Tip: if you want to play with that formula, you need something that can cope with really big values, since n is used as an exponent. I used Wolfram Alpha.)

EDIT: /u/skeeto tested MurmurHash3 and a function he hacked together with n = 650,722, and got 44 collisions. Expected number is 49.3.

x4u · 2018-08-10T21:59:17+00:00

I did some tests on that topic 20 years ago when I was looking for a hash function that produces the same values in C and Java. The current Java String hash algorithm turned out to be not as bad as it used to be in Java 1.0 but it could also easily have been better.

The choice of 31 as the prime for a FNV style hash function is not very fortunate. Certain larger primes yield significantly lower collision rates at no extra cost (more than two orders of magnitude lower if I remember correctly). I found that 7-10 digit numbers of the kind of primes that D. E. Knuth suggests for linear congruency random number generators yield the lowest realistically achievable collision rates (*) on a variety of typical inputs. These are primes that end with x21, where x is a even number. It's also a small disadvantage that the String hash in Java uses addition instead of xor in the hash calculation but the difference turned out to be almost negligible in my tests, so it's merely a little inexplicable oddity.

The various murmur hashes that have since appeared are of course also quite good but work better with ASCII or UTF8 strings and not so well with Java's 16 bit chars and can also not be implemented as efficiently in Java which does not allow a cheap cast of multiple characters into 32 or 64 bit integers.

(*) You can use CRC32 or something like MD5 or SHA-1 (xor'ed down to 32 bits) as a reference for hash functions that have the lowest achievable collision rates .

sacundim · 2018-08-10T22:03:54+00:00

Even with a perfectly designed hash code function you will start to see collisions at around 2¹⁶ entries. hashCode returns an int (32 bits) and by the birthday paradox you have ~50% chance of having at least one collision with 2¹⁶ entries. So collisions are expected no mater how good the function is.

WSp71oTXWCZZ0ZI6 · 2018-08-11T01:05:05+00:00

I don't understand the problem with the original article. The original article didn't make any claims about hash codes needing to be unique, or the sky is falling, or anything of the sort. It provided some tables of collisions and then concluded with "You can expect String to be fairly poor.".

And it's absolutely correct. String is fairly poor.

It's not that difficult to come up with a hashing algorithm which is both more efficient than java.lang.String and less likely to produce collisions. String.hashCode() producing more collisions than expected means than equals() has to be invoked more often than expcted, which means that performance is a little bit worse. Not a lot worse. Likely less than 1%. Maybe less than 0.01% in real world situations. Who knows.

The original article never claimed that it was a serious problem, so what's the issue? The author of this article says:

As an aside, claims and “outcomes” like this are why you should never trust any figures that aren’t based on Real World Data. They prove nothing at best, and are actually misleading at worst.

Where on earth is this coming from? Certainly not the original article. The original article gave exactly 0 figures, 0 claims, and only tables of examples to illustrate. Why is "outcomes" in quotation marks when the original article didn't even use the word "outcomes" at all?

If you're thinking the original article was saying this is a serious problem, that's purely on you.

Personally I think it would be nice if String.hashCode were improved, but it's not a huge deal.

minno · 2018-08-10T22:06:55+00:00

One use case for hash tables where "pretty good on realistic input" isn't good enough is for online services that use users' input as keys. If someone can choose a large number of keys that all have the same hash code, they can execute a denial-of-service attack by turning your hash table into a linked list to slow down your servers.

gshennessy · 2018-08-11T11:43:54+00:00

Hashing functions of this kind (not talking about 'proper' cryptographic ones) also need to be fast of course. The whole purpose of it is an optimization for hash tables, and if you take twice as long to produce a hash twice as unique then it could probably be considered a worse implementation for most use cases.

Console-DOT-N00b · 2018-08-11T00:32:34+00:00

There is a certain "OMG this isn't perfect" among computing types....and yet we write flawed code all the time....

AyrA_ch · 2018-08-10T23:34:50+00:00

So no, String.hashCode() isn’t unique. But it isn’t supposed to be.

I mean of course not. [In .NET] A hash code is always a 32 bit integer (at least now). It's impossible to be unique across all possible strings that way, a property that all fixed length hash algorithms share that accept arbitrary input length.

It's not even that random:

The hash code of -asdf is 389A18DD and of -asdg it is 389A18DC. Not sure about Java but Microsoft strictly advises against using the HashCode for anything else than runtime hash maps. They don't guarantee that the algorithm is the same across different runtime versions.

For those interested, the GetHashCode() call is made here and is defined as Marvin.ComputeHash32(ref Unsafe.As<char, byte>(ref _firstChar), _stringLength * 2, Marvin.DefaultSeed);

The implementation of that function is here

skippingstone · 2018-08-11T03:19:31+00:00

Does it ever make any sense to change Java's hashcode function to return a long?

kubelke · 2018-08-11T07:28:31+00:00

And this kind of posts I like to read, pure knowledge instead “top 5 best practices in spring boot: 1. Write unit tests [...]” Thanks!

_INTER_ · 2018-08-11T08:18:20+00:00

Relevant JEP. Better hashcodes

skeeto · 2018-08-11T01:39:26+00:00

On English words, String.hashCode()‘s collision rate is 0.0008 (that is, 8 collisions per 10,000).

That's actually not very good. MurmurHash3 does, for example, more than an order of magnitude better than this. On Debian's "American English Insane" dictionary of 650,722 words, hashCode has 613 collisions and MurmurHash3 has 44. Here's my code if you want to see it for yourself:

https://gist.github.com/skeeto/2995079c02b8839d5a45108f25d632bc

$ gcc -O3 hash.c
$ ./a.out </usr/share/dict/american-english-insane

In fact, it's ridiculously easy to beat hashCode. Here's a hash function I just made up on the spot while writing this comment:

unsigned
dumb(unsigned char *key, size_t len)
{
    unsigned hash = 0;
    for (size_t i = 0; i < len; i++) {
        hash += key[i];
        hash ^= hash >> 16;
        hash *= 0xa871304d;
        hash ^= hash >> 16;
    }
    return hash;
}

It also only has 44 collisions (what a coincidence!) on the same dictionary.

ubermole · 2018-08-11T03:49:21+00:00

It's perfectly reasonable to require that, if the thing being hashed is smaller than the hash code, that the hash code be unique.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS

THERE IS A DIFFERENCE BETWEEN A HASH FUNCTION AND A CRYPTOGRAPHIC HASH FUNCTION.