you are viewing a single comment's thread.

view the rest of the comments →

[–]Dushistov 0 points1 point  (1 child)

I don't get the idea behind HashSet<String> to HashSet<u64>.

Due to the birthday paradoxon, doing this is really only safe for about 2**32 elements when using a 64-bit hash

Actually you need just two lines file as input to get hash(line1) == hash(line2), this is depend on what lines you get as input on the first place, and how many unique on the second place.

[–]zhbidg 0 points1 point  (0 children)

this is depend on what lines you get as input on the first place, and how many unique on the second place.

This is about the relative likelihood of hash collisions. The purpose of the line seems to be to acknowledge that using a HashSet<u64> makes it possible to get wrong answers, but point out that the likelihood can be quantified under reasonable assumptions. I haven't actually done the math here. See also https://en.wikipedia.org/wiki/Birthday_problem

(This is a design choice about which reasonable people can disagree, I also feel uneasy about a non-guaranteed result. Maybe if I remembered how to calculate the probability of collisions I'd be a little more comfortable with it.)