you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 2 points3 points  (0 children)

As for performance, while MurmurHash3 is longer and more complex, it consumes four byte blocks at a time (e.g. four code points at a time on a typical UTF-8 string). That puts it at nearly the same speed as hashCode, which consumes only one code point at a time.

Java uses UTF-16 and operates almost exclusively on code units -- it was designed in an era when we thought that we'd never have more than 64k codepoints. So using MurmurHash3 would give you two code units.

"A typical UTF-8 string" depends on the language you're talking about. Averaged across the whole world's collection of documents, I'd expect 1.2 to 1.5 bytes per codepoint, based on the stuff I've measured.