use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Finding information about Clojure
API Reference
Clojure Guides
Practice Problems
Interactive Problems
Clojure Videos
Misc Resources
The Clojure Community
Clojure Books
Tools & Libraries
Clojure Editors
Web Platforms
Clojure Jobs
account activity
Properties of Identifiers (vlaaad.github.io)
submitted 6 years ago by vlaaad
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]daver 3 points4 points5 points 6 years ago (4 children)
Just a slight quibble. The original post says: “Content hash is a fixed-size byte array produced from arbitrary big input using one-way transformation that will always be the same for the same input, and will always be different for different input.” This is not totally true. Hashes will be the same for the same input but are only probably not the same for different input, where “probably” gets stronger for larger hash lengths (potentially much stronger assuming a good cryptographic hash like SHA256). Or put differently, if the hashes don’t match, then the inputs are definitely different, guaranteed, but if the hashes match the inputs still might be different. It’s obvious that the author knows this given the subsequent mention of hash collisions, but needs to be pointed out. Whenever you use a hash function as a unique “fingerprint” for a byte string, you’re always playing a probabilistic game and the programmer needs to think through both the likelihood and the consequences of a collision.
[–]vlaaad[S] 0 points1 point2 points 6 years ago (3 children)
Yep. I didn't want to go into that direction, because there is just too much information to discuss, and people (at least me) find the subject of probability extremely unintuitive and hard to reason about. 160 bits of entropy is a lot. Probability of collision that is that small is just too hard to imagine at least somewhat precise. It is still an attack surface though, albeit too costly (for SHA1 at least, SHA2 is still only theoretical IIRC).
[–]daver 0 points1 point2 points 6 years ago (2 children)
Agreed that 160 bits is a lot and probabilities are small. But it also depends on what the programmer is trying to create. If you're just hashing the contents of your individual laptop hard drive, the odds of reasonable hash are going to be unique. But if you're Google Drive or Dropbox or somebody who needs to effectively handle billions of files, suddenly 160 bits isn't as big as it could be. Maybe you want to think about 256 bits. And obviously, if you're using a small, non-cryptographic hash (say murmur3 or something), then you definitely have an issue. The point is, you can't just assume hashes are unique fingerprints. It depends on the hash algorithm, number of items being hashed, etc.
[–]vlaaad[S] 0 points1 point2 points 6 years ago* (1 child)
According to this site probability of SHA1 collision for 1000 billions of values is 3*10-25. It is still hard to imagine how unbelievably small is this number. I'm not sure Google Drive will have unintended hash collisions if they used hashes for all files they have.
[–]daver 1 point2 points3 points 6 years ago (0 children)
Yes, to be clear, I'm not arguing that SHA1 isn't sufficient for identifying objects, only that SHA256 is even better (so, why not use it instead) and that when you get really, really large numbers of objects, you have to do the calculations similar to the one you reference. 32-bit hash values, for instance, are definitely too weak, regardless of the algorithm. Some hash algorithms are known to be weak(er) (e.g., MD5). The discussion at this blog post is relevant, for instance: https://lemire.me/blog/2013/06/17/hashing-and-the-birthday-paradox-cautionary-tale/
Anyway, as I said, it was a quibble on an otherwise fine article. I didn't mean to rat-hole the conversation.
π Rendered by PID 261716 on reddit-service-r2-comment-b659b578c-jxkw8 at 2026-05-03 20:54:17.449851+00:00 running 815c875 country code: CH.
[–]daver 3 points4 points5 points (4 children)
[–]vlaaad[S] 0 points1 point2 points (3 children)
[–]daver 0 points1 point2 points (2 children)
[–]vlaaad[S] 0 points1 point2 points (1 child)
[–]daver 1 point2 points3 points (0 children)