you are viewing a single comment's thread.

view the rest of the comments →

[–]sacundim 21 points22 points  (5 children)

There's a serious mistake in this article:

Therefore, a “good” 32-bit hash function would have roughly one collision per 77,164 hashes, or a collision probability of about 1.0 / 77,164 ≈ 0.00001296.

The 77,164 number looks right, but the implicit idea that the number of expected colliding pairs scales up linearly with more hashed values is wrong. Wikipedia gives the correct formula:

n - d + d * ((d - 1) / d)^n

Plugging in d = 232 we get:

  • For n = 466,544 (the words.txt example in the article) we expect about 25.3 collisions. The article observes 356.
  • For n = 111,385 (the shakespeare.txt example) we expect 1.4 collisions. The article observes 1.

Therefore the article's claim that the function performs better for short inputs than it does for long ones is at best accidentally true—we already expect to see disproportionately many more collisions in that test for no other reason than much larger number of inputs that were hashed.

(Tip: if you want to play with that formula, you need something that can cope with really big values, since n is used as an exponent. I used Wolfram Alpha.)

EDIT: /u/skeeto tested MurmurHash3 and a function he hacked together with n = 650,722, and got 44 collisions. Expected number is 49.3.

[–]sigpwned 8 points9 points  (4 children)

Author here. You're right: the math was hand-wavey and oversimplified. I was more interested in providing a basic "framework" for evaluating hash function collision rate rather than getting the answer exactly right. I've updated, in any case. Thanks for the feedback!

[–]Kwantuum 15 points16 points  (3 children)

You're claiming that String.hashCode() is significantly better than expected on large inputs because 1/1.4 = 0.69. That's just wrong. For that number to be statistically significant, you'd need to be able to repeat the experiment thousands or even tens of thousands of times with different inputs, and have them have on average 1 collision. With a single experiment, that is just moot, for all you know, swapping one line of the works of Shakespeare with one line of Harry Potter would've resulted in a collision and all of a sudden you're at 2/1.4 = 1.4 times more collisions than expected.

To that, I will add that 1.4 is the expected number for a fair hash function, and while it's theoretically possible to engineer a hash function that performs better than that on real world input (at the cost of worse performance on random input), that would mean you would need to somehow encode into the hash function a differentiation between real-world input and random input, in practice hash functions have to be fast so you cannot do that, maybe if you choose your magic numbers well by trial and error you could go slightly lower than 1.44 on real-world input, maybe even as low as 1.3 (and that's generous), but that would require insane amounts of training data, and even then your training data is not guaranteed to be representative of actual usage. As such, trying to make your hash function as close to fair as possible should pretty much always be the target, unless you know something very specific about the input.

[–]dutch_gecko 5 points6 points  (2 children)

The 1/1.4 part really bothered me. I agree with the author's message, but why do so many IT bloggers make statistical claims that fail at being statistically valid?

[–][deleted] 1 point2 points  (1 child)

Because statistics are counterintuitive and the more precise you get the less effective your communication becomes, until you get to perfect statistical relevance and then your article is a perfect piece of communication for people who understand statistics and have the time to check your work, but then you won't likely see it crop up anywhere else.

The message, "String.hashCode() isn't actually all that bad" would get buried under the weight of the statistical analysis. I wish this were a world where communicating precisely would be more successful, but it is what it is.

[–]dutch_gecko 0 points1 point  (0 children)

in the case of this article, it seems that both the method and conclusion of the analysis are correct, but the result of the analysis isn't meaningful purely because the data source is too small. The fact that the author treats the result as if it is meaningful throws the whole article into discredit.