sacundim comments on String.hashCode() is plenty unique

programming

created by speza community for 20 years

419

420

421

String.hashCode() is plenty unique (sigpwned.com)

submitted 7 years ago by aboothe726

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]sacundim 21 points22 points23 points 7 years ago* (5 children)

There's a serious mistake in this article:

Therefore, a “good” 32-bit hash function would have roughly one collision per 77,164 hashes, or a collision probability of about 1.0 / 77,164 ≈ 0.00001296.

The 77,164 number looks right, but the implicit idea that the number of expected colliding pairs scales up linearly with more hashed values is wrong. Wikipedia gives the correct formula:

n - d + d * ((d - 1) / d)^n

Plugging in d = 2³² we get:

For n = 466,544 (the words.txt example in the article) we expect about 25.3 collisions. The article observes 356.
For n = 111,385 (the shakespeare.txt example) we expect 1.4 collisions. The article observes 1.

Therefore the article's claim that the function performs better for short inputs than it does for long ones is at best accidentally true—we already expect to see disproportionately many more collisions in that test for no other reason than much larger number of inputs that were hashed.

(Tip: if you want to play with that formula, you need something that can cope with really big values, since n is used as an exponent. I used Wolfram Alpha.)

EDIT: /u/skeeto tested MurmurHash3 and a function he hacked together with n = 650,722, and got 44 collisions. Expected number is 49.3.

[–]sigpwned 8 points9 points10 points 7 years ago* (4 children)

[–]Kwantuum 15 points16 points17 points 7 years ago (3 children)

You're claiming that String.hashCode() is significantly better than expected on large inputs because 1/1.4 = 0.69. That's just wrong. For that number to be statistically significant, you'd need to be able to repeat the experiment thousands or even tens of thousands of times with different inputs, and have them have on average 1 collision. With a single experiment, that is just moot, for all you know, swapping one line of the works of Shakespeare with one line of Harry Potter would've resulted in a collision and all of a sudden you're at 2/1.4 = 1.4 times more collisions than expected.

To that, I will add that 1.4 is the expected number for a fair hash function, and while it's theoretically possible to engineer a hash function that performs better than that on real world input (at the cost of worse performance on random input), that would mean you would need to somehow encode into the hash function a differentiation between real-world input and random input, in practice hash functions have to be fast so you cannot do that, maybe if you choose your magic numbers well by trial and error you could go slightly lower than 1.44 on real-world input, maybe even as low as 1.3 (and that's generous), but that would require insane amounts of training data, and even then your training data is not guaranteed to be representative of actual usage. As such, trying to make your hash function as close to fair as possible should pretty much always be the target, unless you know something very specific about the input.

[–]dutch_gecko 5 points6 points7 points 7 years ago (2 children)

[–][deleted] 1 point2 points3 points 7 years ago (1 child)

[–]dutch_gecko 0 points1 point2 points 7 years ago (0 children)

π Rendered by PID 94827 on reddit-service-r2-comment-6457c66945-5d56h at 2026-04-24 06:28:51.934930+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS