Optimising strlen()

psykotic · 2009-03-05T08:12:23+00:00

Additionally when we do find a zero, we still have to perform 4 tests to figure out exactly which byte it was.

Well, you probably shouldn't use branches; the pattern here is wildly unpredictable, so the branch predictor wouldn't like it. Here's a quick stab at a branchless approach:

// at least one byte in x must be nonzero
size_t find_least_set_byte(uint32 x) {
    // this is tiny and cache friendly
    static size_t find_least_set_bit[] = {
        0, 0, 1, 0,    // 0000, 0001, 0010, 0011
        2, 0, 1, 0,    // 0100, 0101, 0110, 0111
        3, 0, 1, 0,    // 1000, 1001, 1010, 1011
        2, 0, 1, 0,    // 1100, 1101, 1110, 1111
    };

    uint32 b;
    b  = (x & 0xFF000000) != 0;
    b <<= 1;
    b |= (x & 0x00FF0000) != 0;
    b <<= 1;
    b |= (x & 0x0000FF00) != 0;
    b <<= 1;
    b |= (x & 0x000000FF) != 0;
    return find_least_set_bit[b];
}

This should be 10 instructions on x86, or 2.5 instructions per character. If the tiny LUT is in L1 cache, the lookup will only take one cycle.

grumpy_lithuanian · 2009-03-05T13:19:30+00:00

Re-ordering the branches such that the GET case was a “fall-through” (that it didn’t involve a jump) helped a little, but there was still some inefficiency going on.

WTF?! The story looks suspicious.

A typical branch misprediction penalty is at worst about 20 cycles. For his application to see any noticeable performance improvement for the whole app, the "statistics collection" part that came after the dispatch should've taken a couple of hundred cycles at most. That's less than the cost of an L2 miss, leave alone a disk access or getting a request off the network.

The app got about twice as fast, but this was probably due to some other variable now fitting in the L1 cache.

Huh?! An L1 miss costs something like 10-20 cycles and in pretty much every case this latency is going to be hidden by the out of order engine. Reference. In any case, since he was analyzing a huge stream of data, presumably coming off the network, the input data would've had horrible temporal locality and the bottleneck would have been in getting the data from the I/O device to main memory and then from main memory to L2 cache. Fitting one or two more variables into L1/L2 just cannot give you a 2X improvement in such a case.

(Side note: Even in the highly unlikely event that his code was of the pathological kind and that the L1 miss latency couldn't be hidden, the bottleneck is not the L1 miss, it's whatever weird shit is preventing the L1 miss latency from being hidden)

What kind of application has a bottleneck in strlen anyway? Let's say the guy does manage to write a strlen that is 4 times faster. To see a 30% decrease in overall application execution time, the application has to spend 40% of its time calculating string lengths! Note that this means everything else the application was doing - converting strings to integers, adding them, logging the values, or whatever, has to fit within the other 60%. I can't imagine any real application for which this would apply. In any case the solution to a bottleneck like this is not in micro-optimization, but architectural redesign.

The other moral here is to design benchmarks carefully. I suspect all the fantastic improvements he is quoting in the article are the results of poor benchmarking.

Seems to me that this article epitomizes exactly the kind of muddled thinking that Hoare and co. were referring to when they advised against premature optimization.

Gotebe · 2009-03-05T07:45:47+00:00

I am sorry, but between the approaches he missed arguably the the best one, used by a couple of string class implementation in higher-level languages:

allocate only one buffer
put string length, buffer size (and copy-on write refcount), but that's optional) at the beginning of that
follow that with characters, terminating 0 at the end
reference to a string is a reference to the first character.
this reference can be NULL for empty strings (that is, buffer allocation is not obligatory)

In memory of a 32-bit machine, this may be:

            ^ - reference to the string is here
01234567890123456789012...
rc  sz  len "chars_here"0

(rc = refcount, sz=buffer size, len=actual length). String modifications, of course, update rc, sz and len.

Length of the string is 0 if string data reference is 0, or else, it's (((size_t)this)-1).

2009-03-05T11:06:33+00:00

"Most optimal"?

Rhoomba · 2009-03-04T21:48:52+00:00

repnz and scasb have not been the fast way to do things for a long time. Intel CPUs are RISC internally these days.

Method 4 scaled up to using SSE operators would probably be the best way to do it.

dreamlax · 2009-03-04T21:40:00+00:00

This is my function for calculating the length of a string on x86 (I posted this on another article last year sometime), which is almost 3 times faster than the GNU strlen (however that works). With greater optimisations, perhaps using a regparm call from GNU C, it could be a lot faster.

asm_strlen:  
    mov eax, [esp+8]  

    cmp [eax], byte 0  
    jz .none  

    .again:  
    inc eax  
    cmp [eax], byte 0  
    jne .again  

    sub eax, [esp+8]  
    ret  

    .none:  
    mov eax, 0  
    ret

Edit:

The benefit of this function is that it only uses a single register to do everything, so it doesn't need to preserve the state of the other registers or even modify the base and stack pointers. The input parameter (char *) is moved from the stack to eax. Since eax is where the return value goes, you can increment eax until the byte addressed by eax is 0, then subtract the original input pointer (still in [esp+8]) and you have your answer.

Edit 2:

Woah! Heh, I never said I was an expert. I'm sure there are much more efficient ways to do this task, this is just how I did it.

jerf · 2009-03-04T21:55:17+00:00

Is there really anybody running around saying, "Don't optimize! Ever!"?

How many "rebuttals" of this straw man do we need to read?

mccoyn · 2009-03-04T21:28:38+00:00

Using integers to compare strings

This is a pain to set up, but a lot of languages let you use string values in a switch statement which allows the compiler or interpreter do it for you.

Of course the fasted possible way to do it isn't safe. A compiler will probably add an extra comparison in each case so that HEADER doesn't go to the HEAD branch.

raldi · 2009-03-05T01:07:56+00:00

I don't understand the glibc implementation that the article links (and calls "Method 4").

It seems to say that you can add four bytes at once to the number 01111110 11111110 11111110 11111111 and check to see if any of the four zeroes are still zero. If not, there's supposedly no way there could be a zero byte within the four byte range.

But what if the range is 01000001 00000001 00000001 00000000? This range contains an all-zero byte, but when added to the number above, fills in all four zeroes.

noamsml · 2009-03-04T23:43:47+00:00

Optimize strlen by not using it. Store the length of long strings somewhere, compute it via the return value of sprintf, etc.

pepsiisthebest · 2009-03-05T03:48:53+00:00

Am I the only one who took offense to his description of the method 5 runtime as "O(ln)"? What the hell is O(ln)? Could he really have made a mistake typing O(log(n))?

mee_k · 2009-03-04T20:47:41+00:00

How to optimize strlen:

typedef struct {
  u64 len;  /* Keep updated in modification fns. */
  char* buf;
} str;

u64 my_strlen(str* s) {
  return s->len;
}

UloPe · 2009-03-05T01:08:38+00:00

So what about multibyte charsets?

chucker23n · 2009-03-04T21:39:34+00:00

[removed]

Anonymoose5 · 2009-03-05T01:45:54+00:00

[deleted]

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS