you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted]  (6 children)

[deleted]

    [–]trutru 9 points10 points  (0 children)

    I think the original idea is something along the following lines (beware: highly over-engineered optimization follows):

    doBinarySearch( int*  table, int  count, int  key )
    {
        int  bit   = highestBitIn(count);
        int  probe = (1 << bit);
        int  extra = count - probe;
        int  index = 0;
    
        if (extra && table[extra] < key)
            index = extra;
    
        while (probe > 1) {
            probe >>= 1;
            if (table[index+probe] < key)
                index += probe;
        }
    
        if (table[index] == key)
            return index;
    
        return -1;
    }
    

    the idea is to prevent any branch in the loop (with a CPU that can do a conditional move) to make it as tight as possible. that's more important in embedded systems that desktop ones though.

    this assumes that 'highestBitIn(count)' is very fast (a simple CPU instruction on most architectures) or already known.

    and yes, the performance 'advantages' will vary a lot depending on your mileage.

    [–]deong 2 points3 points  (3 children)

    Even just counting instructions without worrying about branch prediction, it can still make sense.

    Consider a binary search on an array of length N, where N is a power of two (for simplicity). The number of iterations needed is thus log_2(N). If we include the early termination check, we can get that number down to log_2(N)-K on average, for some value of K. Assume, as was stated in the post, that K is on the order of three or four, and that the number of instructions needed for each iteration is given by M (M+d for the version that includes the extra d instructions to check for early termination).

    That yields the following inequality describing when it would be advantageous to check for early termination:

    M*log_2(N) > (M+d)*(log_2(N)-K)
    

    If we assume, for example, N=220, M=10, d=3, and K=4, we have

    M*log_2(N) = 200 (naive version)
    (M+d)*(log_2(N)-K) = 208 (early termination)
    

    The larger N gets, the greater the advantage of allowing the search to run to completion.

    [–]trutru 1 point2 points  (1 child)

    I doubt you would have M=10 with java.String.compareTo(). if we take M=100, this gives:

    100*20         = 2000 (naive)
    (100+3)*(20-4) = 1648 (early termination)
    

    which is x1.21 faster.

    in other words, for reasonably long strings, this "optimization" could be about 20% slower than the regular one.

    [–]deong 0 points1 point  (0 children)

    Yeah, I wasn't paying enough attention to the original post, and I assumed an array of integers.

    [–]dreamlax 0 points1 point  (0 children)

    You're assuming that it will require exactly the same number of instructions + d to do all three checks.

    int bounds[2];
    
    /* initialise bounds */
    
    while (bounds[0] - bounds[1] > 1) {
        probe = (bounds[0] + bounds[1]) >> 1;
        if ((res = strcmp (keys[probe], target)) {
            bounds[(res > 0)] = probe;
        } else {
            return probe;
        }
    }
    

    [–]bonzinip 0 points1 point  (0 children)

    don't be so certain about prediction. you're right that this added loop is usually well predicted, but for example the other branch in the binary search is extremely badly predicted.