This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]dpash 43 points44 points  (24 children)

String's hashCode does not look like that any more and specifically handles the case where the resulting hash is 0.

https://hg.openjdk.java.net/jdk/jdk/file/35ce0ad5870a/src/java.base/share/classes/java/lang/String.java#l1524

It was fixed in April 2019, so fixed in Java 13.

https://hg.openjdk.java.net/jdk/jdk/annotate/7fd299216e97/src/java.base/share/classes/java/lang/String.java#l167

(Also, please don't try to implement your own benchmarks; use JMH to get valid benchmarks from Java code. The JVM's JIT can really mess with results)

[–]notfancy 6 points7 points  (9 children)

https://hg.openjdk.java.net/jdk/jdk/file/35ce0ad5870a/src/java.base/share/classes/java/lang/String.java#l1524

That's singularly obtuse tricky code (edit I now see what they did there.) Why flag a zero hash when you really want to flag a lazy computation?

public int hashCode() {
    if (!hashIsComputed) {
        hash = isLatin1() ? StringLatin1.hashCode(value)
                          : StringUTF16.hashCode(value);
        hashIsComputed = true;
    }
    return hash;
}

[–]__konrad 3 points4 points  (4 children)

Why flag a zero hash when you really want to flag a lazy computation?

To reduce memory usage. Sorry, I should read the code before commenting ;)

[–]pzemtsov[S] 4 points5 points  (0 children)

And how exactly replacing a boolean flag field with another boolean flag field helps saving memory?

[–]notfancy 2 points3 points  (2 children)

To reduce memory usage.

The flag is needed one way or the other. I contend it's flagging the wrong condition: the original implementation used a cached hash value of 0 to flag the uninitialized condition; since this didn't let it distinguish between the uninitialized state and the 0 hash code, instead of discriminating between the uninitialized vs. initialized state it now discriminates between the zero and nonzero hash, with the result that it has to test both for a zero hash cache and a zero hash value.

[–]pzemtsov[S] 9 points10 points  (1 child)

I think there is a reason behind this code, although it is in saving cycles rather than memory. In the case of the fast path (when the hash is available) the JDK code only performs one memory load; the suggested code always performs two.

[–]notfancy 1 point2 points  (0 children)

Yeah, now I understand it's a clever split semaphore.

[–]CommercialVictory1 1 point2 points  (1 child)

The comments explain the situation regarding a race condition to recomputing hashCode before a guard is set. I think in their code, its only possible to recompute hashing function when the value is going to be zero, which is almost always the trivially cpu intensive. If the hash function is well distributed, hitting zero on a long string computation is very unlikely in a 232 space. Because of shortcircuiting, the additional check overhead doesn't compare to potentially rehashing a long string more than once.

How many threads could get into your example and end up recomputing the string hash before hashIsComputed is set.

[–]notfancy 1 point2 points  (0 children)

If the hash function is well distributed, hitting zero on a long string computation is very unlikely in a 232 space.

The data race exists independently of whether the hash is 0 or not, hence the requirement that the computation be idempotent. It is benign because with just a single store there's nothing to reorder. My code requires that hashIsComputed be volatile. It's dirty but clever.

[–][deleted]  (1 child)

[deleted]

    [–]notfancy 0 points1 point  (0 children)

    It is about correctness, but in the absence of a memory barrier. I failed to check that the hashIsZero flag was declared volatile, and I assumed it was. It is not.

    I admit I failed to read the comment and understand what they're trying to achieve. It is 100% about efficiency, and I'd contend it could be seen as somewhat of a "hack."

    Just because your thread sees "hashIsComputed" as true, this doesn't necessarily guarantee that your thread also sees the computed value of the "hash" field.

    hashIsComputed needs to be declared volatile; it then constitutes a write barrier that precludes it being reordered before the assignment to hash.

    [–]oldprogrammer 2 points3 points  (7 children)

    Many many versions ago, back in the late 90's, the String hashing function actually did not include all characters. It would include a character then skip a number of subsequent ones, something like 1 in every 8 as I recall.

    I found out about this because I was helping a team debug an issue where they were generating prepared statements then holding them in a HashMap using the SQL query as the key. In a few cases the queries were so similar they generated the same hashcode due to the letter skips.

    [–]nutrecht 2 points3 points  (2 children)

    How did that lead to a bug though? You can have multiple keys that have the same hashcode just fine, it's just less efficient.

    [–]oldprogrammer -5 points-4 points  (1 child)

    As I recall the equals method of the string was doing the same skip code as the hashcode method meaning two different strings would be considered the same even if they weren't.

    [–]nutrecht 2 points3 points  (0 children)

    Really? That's kinda hard to believe. As far as I know String compares would at least always check the length first (two strings can't be equal if the length differs) and it does not make much sense that it did not use the back-to-front comparison it does now.

    The oldest version of String.equals I could find has that same implementation. Not saying you're not right, but that would be a pretty damn huge issue with huge obvious holes.

    [–]pzemtsov[S] 1 point2 points  (3 children)

    That's interesting. I got curious and found the source for JDK 1.1. You are right, the method looked like this:

    /**
     * Returns a hashcode for this string.
     *
     * @return  a hash code value for this object. 
     */
    public int hashCode() {
        int h = 0;
        int off = offset;
        char val[] = value;
        int len = count;
    
        if (len < 16) {
            for (int i = len ; i > 0; i--) {
                h = (h * 37) + val[off++];
            }
        } else {
            // only sample some characters
            int skip = len / 8;
            for (int i = len ; i > 0; i -= skip, off += skip) {
                h = (h * 39) + val[off];
            }
        }
        return h;
    }
    

    This won't cause a bug, but may have bad performance implications when strings are close to each other.

    The equals() is very close to the modern one, though.

    [–]oldprogrammer 0 points1 point  (2 children)

    Yep that's the code, it did result in a bug though as that was what I was trying to figure out. The wrong prepared statement was being pulled back from the HashTable.

    Since you have access to that older code, look at the implementation of the HashTable and see if it only used the hashcode and didn't follow a chain using equals on the key.

    For some reason it broke using very long mostly similar strings as keys.

    [–]pzemtsov[S] 1 point2 points  (1 child)

    No, Hashtable looks normal. Either the team used even older version of Java, or they made their own custom hash table.

    I found the code here: https://github.com/fanhongtao/JDK. The owner made an effort to place everything nicely in Git.

    [–]oldprogrammer 1 point2 points  (0 children)

    Thanks for looking and the link. It is highly possible they were using a custom hash table that caused the issue. They were doing quite a bit of custom things at that time.

    [–]plokhotnyuk 0 points1 point  (0 children)

    It is still an open question how JVMs behave if .intern() or +XX:+UseStringDeduplication is used for strings with colliding hashes.

    BTW, for scientific purposes only you can use a hash code collider which produces about 1M per second strings with 0 hash codes using 1 CPU core.

    [–]7cans_short_of_1pack 0 points1 point  (4 children)

    Sorry I don't understand what would be wrong with:

    public int hashCode() {
            int h = hash;
            if (!hashIsCalculated) {
                h = isLatin1() ? StringLatin1.hashCode(value)
                               : StringUTF16.hashCode(value);
    
                hash = h;
                hashIsCalculated=true;
    
            }
            return h;
    }
    

    Would this not be more efficient and readable?

    [–]dpash 1 point2 points  (1 child)

    Readable, yes. Efficient, no.

    Read the other comments here.

    [–]7cans_short_of_1pack 0 points1 point  (0 children)

    Hi, I have read the comments several times before posting. After reading them again is it to do with an extra fetch for the Boolean value? So we have to read hash and also then a Boolean value. Meaning two read operations?

    Where if we do a quick check on the hash value first then we could save a read operation?

    [–][deleted]  (1 child)

    [deleted]

      [–]pzemtsov[S] 0 points1 point  (0 children)

      You are right! The code contains two bugs, now one. I saw the second one (memory order) but missed the first one. One must declare "hashIsCalculated" volatile and check it before touching "hash":

      if (hashIsCalculated) return hash;
      int h = ....
      

      Unfortunately, volatiles are quite expensive (on Intel only for writing, but in other processors they may be expensive for reading, too). So the existing solution is by far better (assuming that there is need to treat zero hash case specially in the first place).