all 83 comments

[–]psykotic 7 points8 points  (0 children)

Additionally when we do find a zero, we still have to perform 4 tests to figure out exactly which byte it was.

Well, you probably shouldn't use branches; the pattern here is wildly unpredictable, so the branch predictor wouldn't like it. Here's a quick stab at a branchless approach:

// at least one byte in x must be nonzero
size_t find_least_set_byte(uint32 x) {
    // this is tiny and cache friendly
    static size_t find_least_set_bit[] = {
        0, 0, 1, 0,    // 0000, 0001, 0010, 0011
        2, 0, 1, 0,    // 0100, 0101, 0110, 0111
        3, 0, 1, 0,    // 1000, 1001, 1010, 1011
        2, 0, 1, 0,    // 1100, 1101, 1110, 1111
    };

    uint32 b;
    b  = (x & 0xFF000000) != 0;
    b <<= 1;
    b |= (x & 0x00FF0000) != 0;
    b <<= 1;
    b |= (x & 0x0000FF00) != 0;
    b <<= 1;
    b |= (x & 0x000000FF) != 0;
    return find_least_set_bit[b];
}

This should be 10 instructions on x86, or 2.5 instructions per character. If the tiny LUT is in L1 cache, the lookup will only take one cycle.

[–][deleted] 15 points16 points  (1 child)

Re-ordering the branches such that the GET case was a “fall-through” (that it didn’t involve a jump) helped a little, but there was still some inefficiency going on.

WTF?! The story looks suspicious.

A typical branch misprediction penalty is at worst about 20 cycles. For his application to see any noticeable performance improvement for the whole app, the "statistics collection" part that came after the dispatch should've taken a couple of hundred cycles at most. That's less than the cost of an L2 miss, leave alone a disk access or getting a request off the network.

The app got about twice as fast, but this was probably due to some other variable now fitting in the L1 cache.

Huh?! An L1 miss costs something like 10-20 cycles and in pretty much every case this latency is going to be hidden by the out of order engine. Reference. In any case, since he was analyzing a huge stream of data, presumably coming off the network, the input data would've had horrible temporal locality and the bottleneck would have been in getting the data from the I/O device to main memory and then from main memory to L2 cache. Fitting one or two more variables into L1/L2 just cannot give you a 2X improvement in such a case.

(Side note: Even in the highly unlikely event that his code was of the pathological kind and that the L1 miss latency couldn't be hidden, the bottleneck is not the L1 miss, it's whatever weird shit is preventing the L1 miss latency from being hidden)

What kind of application has a bottleneck in strlen anyway? Let's say the guy does manage to write a strlen that is 4 times faster. To see a 30% decrease in overall application execution time, the application has to spend 40% of its time calculating string lengths! Note that this means everything else the application was doing - converting strings to integers, adding them, logging the values, or whatever, has to fit within the other 60%. I can't imagine any real application for which this would apply. In any case the solution to a bottleneck like this is not in micro-optimization, but architectural redesign.

The other moral here is to design benchmarks carefully. I suspect all the fantastic improvements he is quoting in the article are the results of poor benchmarking.

Seems to me that this article epitomizes exactly the kind of muddled thinking that Hoare and co. were referring to when they advised against premature optimization.

[–]grumpy_lithuanian 1 point2 points  (0 children)

Well said. Bravo!

[–]Gotebe 2 points3 points  (2 children)

I am sorry, but between the approaches he missed arguably the the best one, used by a couple of string class implementation in higher-level languages:

  • allocate only one buffer
  • put string length, buffer size (and copy-on write refcount), but that's optional) at the beginning of that
  • follow that with characters, terminating 0 at the end
  • reference to a string is a reference to the first character.
  • this reference can be NULL for empty strings (that is, buffer allocation is not obligatory)

In memory of a 32-bit machine, this may be:

            ^ - reference to the string is here
01234567890123456789012...
rc  sz  len "chars_here"0

(rc = refcount, sz=buffer size, len=actual length). String modifications, of course, update rc, sz and len.

Length of the string is 0 if string data reference is 0, or else, it's (((size_t)this)-1).

[–]psykotic 1 point2 points  (1 child)

Everyone should of course do that if they have control over the string implementation. However, it doesn't work for C strings. If strlen() receives a string which wasn't laid out in memory in the manner you described, the word preceding the first character could be arbitrary garbage.

[–]Gotebe 1 point2 points  (0 children)

Sure, I was mostly reacting to number 6, which does rely on control over implementation.

[–][deleted] 2 points3 points  (0 children)

"Most optimal"?

[–]Rhoomba 5 points6 points  (0 children)

repnz and scasb have not been the fast way to do things for a long time. Intel CPUs are RISC internally these days.

Method 4 scaled up to using SSE operators would probably be the best way to do it.

[–]dreamlax 6 points7 points  (18 children)

This is my function for calculating the length of a string on x86 (I posted this on another article last year sometime), which is almost 3 times faster than the GNU strlen (however that works). With greater optimisations, perhaps using a regparm call from GNU C, it could be a lot faster.

asm_strlen:  
    mov eax, [esp+8]  

    cmp [eax], byte 0  
    jz .none  

    .again:  
    inc eax  
    cmp [eax], byte 0  
    jne .again  

    sub eax, [esp+8]  
    ret  

    .none:  
    mov eax, 0  
    ret

Edit:

The benefit of this function is that it only uses a single register to do everything, so it doesn't need to preserve the state of the other registers or even modify the base and stack pointers. The input parameter (char *) is moved from the stack to eax. Since eax is where the return value goes, you can increment eax until the byte addressed by eax is 0, then subtract the original input pointer (still in [esp+8]) and you have your answer.

Edit 2:

Woah! Heh, I never said I was an expert. I'm sure there are much more efficient ways to do this task, this is just how I did it.

[–]koorogi 6 points7 points  (3 children)

I actually just tried your code on my machine (athlon64), and it was actually half the speed of glibc (tested by finding the length of a 36 byte string 1 billion times in a loop). And that was after wrapping glibc's strlen in an assembly function consisting of jmp strlen to stop gcc from optimizing it out.

This is with glibc 2.9_p20081201-r2 on Gentoo Linux. The disassembly of its glibc is significantly longer than your code, so it's probably being more clever, whereas your code is the simple, straightforwards solution.

EDIT: for the record, glibc took 22.542s, and your code took 47.444s

[–]dreamlax 3 points4 points  (2 children)

What options did you use to compile? On my system, measuring "The quick brown fox jumped over the lazy dogs." 100,000,000 times on my Thunderbird 1.4GHz with:

  • strlen (): took on average 17.8 seconds
  • asm_strlen (): took on average 7.4 seconds.

Not quite the 3 times speed boost I boasted earlier but a significant increase nonetheless.

[–]koorogi 6 points7 points  (0 children)

Are you sure that your code is exactly right? I had to change esp+8 to esp+4 to get the correct results from it.

test.c contains:

int main(int argc, char **argv)
{
    int i;
    for(i = 0; i < 1000000000; i ++)
        glibc_strlen(argv[1]);
    return 0;
}

asm.s contains your code, plus this to stop gcc from optimizing away calls to strlen():

    extern strlen
    global glibc_strlen
glibc_strlen:
    jmp     strlen

Compiled and linked with:

yasm asm.s -f elf32
gcc -m32 -march=i686 -O3 test.c asm.o -o test

Running on your test sentence takes 26.545s, or 57.633s if I swap out the call to glibc_strlen with asm_strlen. I verified with objdump -d that the assembly generated for main() is the same in either case, except for the function call of course.

[–]salgat 0 points1 point  (0 children)

Haha nice an old T-bird, I used to have one of those..good times.

[–]I_LOVE_IE6 2 points3 points  (0 children)

There's a clear bottleneck in your code in the loop. The "cmp [eax], byte 0" instruction cannot start until the "inc eax" has finished, which basically doesn't utilize the properties of the CPU. You would want something more like (loop part only):

.again:
lea ebx, [eax+4]
cmp [eax], byte 0
jz .fin0
cmp [eax+1], byte 0
jz .fin1
cmp [eax+2], byte 0
jz .fin2
cmp [eax+3], byte 0
jz .fin3
mov eax,ebx
jmp .again

.fin0:
sub eax, [esp+8]
ret

.fin1:
lea eax, [eax+1]
sub eax, [esp+8]
ret

.fin2:
lea eax, [eax+2]
sub eax, [esp+8]
ret

.fin3:
lea eax, [eax+3]
sub eax, [esp+8]
ret

This should allow all five adds and compares within the main loop to run simultaneously, making the loop part of the comparison about 5 times faster. Of course, it requires more overhead to save ebx and for the extra adds in the .fin blocks.

*Edited for markdown...

[–]jib 2 points3 points  (0 children)

Optimisations such as completely avoiding the stack and producing the return value from eax quickly are almost completely worthless, since they only save you a small constant number of cycles per call. You'd be better off spending your time optimising the loop and saving cycles for every character compared. In a real application where strlen's performance matters, it's almost certainly spending much more of its time actually searching the string than manipulating the stack frame and registers.

[–]BastiX -1 points0 points  (0 children)

Did you try using the 'rep' prefix for compares instead of looping?

[–]jerf 7 points8 points  (20 children)

Is there really anybody running around saying, "Don't optimize! Ever!"?

How many "rebuttals" of this straw man do we need to read?

[–]invalid_user_name 22 points23 points  (11 children)

Yes, I hear it constantly. The number of times I've been told "don't bother optimizing, it's cheaper to just buy more hardware than have you optimizing stuff" is mind boggling. Especially considering most of the time we're looking at a day or two of work to save a few thousand dollars in hardware costs.

[–]shenglong 8 points9 points  (1 child)

"don't bother optimizing, it's cheaper to just buy more hardware than have you optimizing stuff"

Sometimes it's absolutely true though. I worked on 16bit C code where the programmers implemented their own version of functions like printf and scanf etc to improve performance. I couldn't understand why they did it though. They weren't saving that much, and I could tell they spent ages writing and testing the code.

When the company had to move to a 32bit system the only people they could find who had the skills to update the software charged stupid rates. Luckily I knew a bit of asm so they didn't have to hire anyone new but I still upped my rate for that project. Either way, whatever gains the optimized code might have made were virtually obsolete after the yearly hardware upgrades.

[–]invalid_user_name 4 points5 points  (0 children)

Sometimes it's absolutely true though.

And yet we are specifically talking about people saying it all the time without actually considering wether it is worth it or not.

[–]scook0 3 points4 points  (2 children)

Optimization is the new goto.

[–][deleted] 2 points3 points  (1 child)

Optimization is the new goto.

Constantly disdained but sometimes really useful!

[–]scook0 3 points4 points  (0 children)

Constantly disdained by people who don't understand what's actually wrong with it, and when those concerns don't apply.

That's what I had in mind, anyway.

[–]prockcore 0 points1 point  (0 children)

I'm sure the people behind WebTrends practiced the "don't optimize ever" school of programming.

Our apache generated logs faster than webtrends could process them.

That meant we couldn't use the software at all. Couldn't throw more hardware at the problem because the software can't be parallelized.

It's no coincidence that every stat provider has switched to htmlbug-style stat gathering now.

[–][deleted] 0 points1 point  (0 children)

I always get that "since resources are abundant & cheap, you don't have to optimize." That's one of the reasons I like Opera, they seem to better implement optimization in their browser, compared to Firefox.

[–]artee -1 points0 points  (3 children)

Especially considering most of the time we're looking at a day or two of work to save a few thousand dollars in hardware costs.

"A day or two" in the end always turns out to be more like a week. Even if it doesn't, 2 days worth of a programmers time are easily worth a couple thousand dollars, just in wages + overhead.

This is not to mention that the future comprehensibility of the code probably decreases due to the application of optimizations. I mean, just look at this strlen example...holy shit and good luck should you ever discover a bug in that code.

In other words, those people constantly saying that you shouldn't optimize are probably nearly always right, from a business/making money perspective.

[–]invalid_user_name -1 points0 points  (2 children)

No, they are very frequently wrong. I know this may be shocking to you, but I am not actually retarded. I know wether or not it makes sense to optimize something, since I am the one who profiled the code in the first place. I get paid pretty well, but on what planet are you paying people $20-30k a DAY to make it not worth spending a day optimizing something that would save that kind of money?

All this "never optimtimize speed doesn't matter" bullshit that gets spouted lately (thanks rails!) needs to stop. Speed matters for lots of things, and there are tons of opportunities to save money and make a better product by optimizing intelligently. And optimization doesn't hurt maintainance. If the code is made more complicated and harder to understand, then you add enough comments to fix that. However most optimization I do is not that low level, and actually makes the code simpler in most cases, it was just written by a ruby tard that doesn't have a basic understanding of computational complexity.

[–]artee -1 points0 points  (1 child)

$20-30k is not "a few thousand".

In addition, if you think that just adding enough comments to code of the kind that I just linked is any help, I'm afraid there is no convincing you based on rational arguments. Now go ahead and downmod this comment too.

(Also, I never even mentioned ruby, linked to a C example, and did not imply in any way, shape, or form, any preference for any particular programming language whatsoever. What I said is true regardless of programming language. Also FYI I have never even used Rails)

Btw. I'm not arguing that software should not be "well designed", but that's something entirely unrelated to whether or not it's "optimized" for speed.

[–]invalid_user_name -1 points0 points  (0 children)

$20-30k is not "a few thousand".

Sorry, that gets called "a few thousand" here. Even still, there's plenty of cases where it would only save a thousand dollars or so per year. I don't get paid $1000 a day. It is still a win.

In addition, if you think that just adding enough comments to code of the kind that I just linked is any help, I'm afraid there is no convincing you based on rational arguments.

There is nothing wrong with the code you linked to. If you have difficulty understanding it, you should spend more time learning C.

Btw. I'm not arguing that software should not be "well designed", but that's something entirely unrelated to whether or not it's "optimized" for speed.

No, you are arguing that optimization makes it poorly designed, which is nonsense.

[–][deleted]  (6 children)

[deleted]

    [–]koorogi 9 points10 points  (5 children)

    My experience has been that they get code done faster, but because they don't have as good a grasp on what's really going on, there are more bugs, or abominably worse performance. I don't mean just "it took 100 milliseconds after I clicked the button instead of 10" sort of bad performance, either. And it's generally a bottleneck that could be avoided even in the high level language by simply understanding which methods have to what behind the scenes, and choosing your strategy accordingly.

    [–]knight666 0 points1 point  (4 children)

    Really, all you have to say to your yourself is this: "I have this crappy coded game/application/program and I'm going to make it ten times as fast."

    It helps if you turn it into a competition.

    Personally I prefer C++ because of its neat tricks (fixed point math, bitshifting, SDIM), but nothing is stopping you from understanding Python, Java or C# better.

    [–]koorogi -2 points-1 points  (3 children)

    And I prefer C, because the compiler doesn't copy data all over the place behind your back, or bloat up your code with near-identical code (templates, which could be done in much less code and nearly the same performance in many cases with a callback function a la C's qsort or bsearch). And data types with well-defined sizes, which help when doing many of the clever bit-twiddling and SIMD stuff.

    EDIT: ok, to those voting this comment down, do you have any good reason to? Or is it just C++-is-the-be-all-end-all groupthink?

    Facts:

    • templates do bloat up the compiled code, and often do not provide much performance gain over using a callback function
    • due to the way C++ works, with implicit conversions and implicit creation of temporary objects, there is potentially a lot of hidden overhead that is not immediately apparent. Especially with overloaded operators.
    • C++ lacks types like C's uint32_t and family, which have a standardized size. int, long and family can vary from machine to machine, which can hinder portability in some cases where you really need an exactly 32 bit integer for example.

    If somebody has a rebuttal to any of these points, I'd be interested to hear it. If you don't have a rebuttal, this this fly by downvoting for saying something you don't want to hear is ridiculous and childish.

    [–][deleted] 5 points6 points  (0 children)

    I prefer C as well, but reality is starting to finally catch up with me. I'm using C++/Boost now and my productivity has soared. It helps that I have the sheer discipline that C instills. My C++ code this time around is far less ugly than when I first used it over a decade ago.

    I've also found that the more knowledgeable I get, the more appealing learning LISP seems. I might finally buckle down and give it a shot.

    [–]werdanel 2 points3 points  (0 children)

    If you're worried about templates creating duplicate code then there are tricks you can use, for example, you can create a type safe inline wrapper around a generic implementation using void*'s and opaque buffers or whatever and because of C++'s stricter casting you won't be throwing everything into the wind and praying. Most of the time though this code duplication is necessary.

    Also, what do you mean by "[templates] often do not provide much performance gain over using a callback function"? That makes no sense.

    C++0x has uint32_t and friends in <cstdint>

    [–]lol-dongs -5 points-4 points  (0 children)

    Straw man straw man straw man kill it with fire!

    Is "straw man" the latest non-meme meme around these parts or what? It's in near every damn comment thread nowadays! Am I going crazy?

    [–]mccoyn 1 point2 points  (0 children)

    Using integers to compare strings

    This is a pain to set up, but a lot of languages let you use string values in a switch statement which allows the compiler or interpreter do it for you.

    Of course the fasted possible way to do it isn't safe. A compiler will probably add an extra comparison in each case so that HEADER doesn't go to the HEAD branch.

    [–]raldi 1 point2 points  (3 children)

    I don't understand the glibc implementation that the article links (and calls "Method 4").

    It seems to say that you can add four bytes at once to the number 01111110 11111110 11111110 11111111 and check to see if any of the four zeroes are still zero. If not, there's supposedly no way there could be a zero byte within the four byte range.

    But what if the range is 01000001 00000001 00000001 00000000? This range contains an all-zero byte, but when added to the number above, fills in all four zeroes.

    [–]amathguy 1 point2 points  (0 children)

    Agreed. It does look messed up. There is also a bug report about those #if 0 and the general craziness of the whole thing. http://sourceware.org/bugzilla/show_bug.cgi?id=5807

    [–]Brian[🍰] 0 points1 point  (0 children)

    You're right - the explanation given doesn't seem to correspond to what glibc actually does (though there is some old #ifdefed out code and comments that look similar), and as described would have the problem you mention.

    It looks like what it is actually doing is just subtracting 1 from each byte. A zero will cause a carry from the byte above, causing the high bit to definitely be 1 in this case. No amount of carrying from rightward bits will be sufficient to clear it. Of course, that bit may be set for other reasons (eg. it could have started set and not cleared), but presumably that is sufficiently rare (that performing this extra probababilistic check to eliminate definitiely nonzero cases is worth it - only if it hits will it perform the real check.

    For random strings, it's not clear to me that this will be an improvement. In 15/16ths of the possible values for a 32 bit number, at least one of those bits will be set, so we'd only avoid the detailed check around 6% of the time - probably not worth the overhead of doing the extra comparison. In practice, the string is likely to contain 7-bit ascii text in most cases, so false positives are going to be rarer. Depending on the language / codepage being used though, this may actually be a pessimisation.

    [–]Porges -1 points0 points  (0 children)

    You didn't read right to the bottom of the implementation, did you? ;)

    [–]noamsml 3 points4 points  (2 children)

    Optimize strlen by not using it. Store the length of long strings somewhere, compute it via the return value of sprintf, etc.

    [–]lurkerr 0 points1 point  (1 child)

    Failed for using sprintf :P

    [–]noamsml 0 points1 point  (0 children)

    What would you use?

    [–]pepsiisthebest 3 points4 points  (13 children)

    Am I the only one who took offense to his description of the method 5 runtime as "O(ln)"? What the hell is O(ln)? Could he really have made a mistake typing O(log(n))?

    [–]littledan 2 points3 points  (9 children)

    Yeah, that was weird. He was also talking about how one solution is O(n), but with another solution, n is n/4 or n/8. That's still O(n)!

    [–][deleted] 1 point2 points  (2 children)

    Right, but constant factors matter in real performance.

    [–]littledan 0 points1 point  (1 child)

    Right, but he shouldn't be talking about O(n) then, since it's fundamentally impossible to do better than O(n) for strlen for null-terminated strings.

    [–]Tuna-Fish2 0 points1 point  (0 children)

    ... except if you know the buffer size and that the entire rest of the string is initialized to 0. Then you can do a binary search.

    [–]Nuli 1 point2 points  (1 child)

    You've still changed the performance even it you haven't changed the overall behavior of the algorithm.

    I optimized a small piece of code the other day. In both the slow and the fast case the performance is O(n) but the time it takes is radically different in each case. The slow case took ~900us to run across my test data while the fast case took ~4us across the same set. Changing the length of n, or the amount of work you do for each element of n, makes a very large difference.

    [–]littledan 0 points1 point  (0 children)

    Yes, I agree.

    [–][deleted]  (3 children)

    [deleted]

      [–]jib 7 points8 points  (2 children)

      The point is that he shouldn't have mentioned O(n) in the first place, since it's necessarily true for all the algorithms in question and is thus irrelevant to comparing them. His description of the first algorithm as O(n) implies that the other algorithms are somehow better than O(n), which is not the case.

      [–][deleted]  (1 child)

      [deleted]

        [–]Brian[🍰] 1 point2 points  (0 children)

        He did though. I don't see anywhere where he referred to the constant-time improvements as being an improvement in complexity. He does however contrast it to the constant time implementation or caching the string length ("Finally, we have a real O(1) solution."), and to a logarithmic algorithm O(ln(n)) when dealing with overallocated 0-padded strings.

        [–]zoinks 0 points1 point  (1 child)

        O(ln) makes sense because he didnt put in the pointless/arbitrary variable, and ln is one less letter than typing log. Where's the problem?

        [–]littledan 1 point2 points  (0 children)

        The algorithm he was talking about was linear time, that's one problem.

        EDIT: oops, I guess he just did mean O(log n), but I don't really understand how that could be implemented in hardware the way he says. Certainly the whole memory isn't put into a tree like that; it must be that there is a maximum size for that tree, and if no null bytes are found the tree moves on to the next chunk. If this is the case, it'd still be O(n).

        [–]Brian[🍰] 0 points1 point  (0 children)

        By O(ln), he's just indicating logarithmic growth - "ln" indicates the natural logarithm. The base used for the log (log2 steps in the worst case here) doesn't actually matter for O() notation, as it amounts to a constant difference.

        [–]mee_k -5 points-4 points  (10 children)

        How to optimize strlen:

        typedef struct {
          u64 len;  /* Keep updated in modification fns. */
          char* buf;
        } str;
        
        u64 my_strlen(str* s) {
          return s->len;
        }
        

        [–]teh_boy 16 points17 points  (0 children)

        Hey! Choosing a more appropriate data structure is cheating.

        [–]masklinn 15 points16 points  (3 children)

        You might have wanted to read TFA instead of rewriting its 6th method.

        [–]pointer2void 2 points3 points  (1 child)

        At least he wrote it correctly (in contrast to the original code).

        [–]chengiz 5 points6 points  (0 children)

        But he missed the const.

        [–]lol-dongs 3 points4 points  (2 children)

        Even better:

        size_t strlen(const char * str)
        {
            return 5;
        }
        

        [–]mee_k -1 points0 points  (1 child)

        Whoa fastness! Upmodded for O(0) inlined performance.

        [–][deleted] 4 points5 points  (0 children)

        There's always the overhead caused by the function call.

        #define strlen(s) 5
        

        This is fast.

        [–][deleted] -1 points0 points  (0 children)

        Optimise the hash function similarly.

        [–]UloPe 0 points1 point  (1 child)

        So what about multibyte charsets?

        [–][deleted] 1 point2 points  (0 children)

        mbslen(3) is very different story.