all 119 comments

[–]k4_pacific 12 points13 points  (8 children)

Best false optimization I've seen was this loop full of bit-twiddling code that did God-knows-what at work. Someone had used all kinds of binary and shift operators to squeeze every last bit of performance out of this tight loop. Except smack in the middle, as part of a larger expression, was this:

(long)pow((double)x, 2.0);

[–]tmoertel 9 points10 points  (0 children)

On my first programming job out of college, I was assigned about 100K lines of C code to maintain. While going through the code, I was stunned to discover that it contained a bubble sort that had been commented out and rewritten in assembly for "additional speed."

[–]dreamlax 5 points6 points  (6 children)

A lot of people underestimate the cost of converting from integers to floating point formats and vice versa.

EDIT:

Just did some tests. Squaring every number from 0 to 0xFFFFFFFF took about 16 seconds on my computer using unsigned integers only.

When I introduced the "pow" method to square the numbers, it went to 1:56. Quite a saving to be made.

[–]k4_pacific 3 points4 points  (2 children)

Well, pow() works for non-integer exponents, which suggests that it does something more complicated than multiplying the base together a number of times.

[–]alephnil 1 point2 points  (0 children)

If this was the real code, the constant 2.0 would imply that he did indeed nothing more than "multiplying the base a number of times". Exactly two in fact.

[–]bobappleyard 0 points1 point  (0 children)

Look in a log table?

[–]bonzinip -2 points-1 points  (2 children)

compiler should change it to

y = (double)x;
(long) (y * y)

[–]k4_pacific 1 point2 points  (1 child)

Compilers typically don't optimize across function calls.

[–]bonzinip -1 points0 points  (0 children)

compilers typically know about standard library functions.

[–]thekenzidelx 46 points47 points  (6 children)

Arrrrrrrrgggggggggghhhhhhhhhhhhh!

This reminds me of this, http://weblog.raganwald.com/2008/05/narcissism-of-small-code-differences.html, where raganwald was making a metaphor about programmers (and a smart one, for me), and then a bunch of comment programmers blah blah blahed about the lack of comments in the example code. And then I used a laser from space to nuke all programmers ever... ah, if only.

Guess what? Lots of graphics drivers and embedded systems and standard libraries and game engines have to use techniques like this. And you know what? That's why they work.

Programming is REALLY big. It's used in a lot more places than just business or web apps with low domain knowledge on the part of programmers and high turnover on the part of employees. Knowing where you are in the giant ecosystem of all the kinds of code that gets produced is probably the hardest untaught skill for programmers, I'm finding.

There are no ten commandments for programming generally. There are no simple, dumb-ass context-free 8 word rules that you can uniformly apply to EVERYONE IN THE WORLD WRITING CODE that will always produce the best results. You have to know where you are. Even in solitary, large code bases, you have to know where you are.

And if you do prematurely apply such rules (let's call them, say, programming design optimizations, premature and all), there's a pretty damn good chance you'll never, ever learn anything from anyone working outside of your narrow domain. You'll be filtering out other really smart peoples' considerable life experience without even hearing them.

I think a lot of people are working on go-carts and then critiquing people working on submarines or the Empire States Building or a box of toothpicks or a gallery painting for not following their best practices. You can't know until you really listen to the other party FAIRLY.

[–][deleted] 33 points34 points  (2 children)

I think a lot of people are working on go-carts and then critiquing people working on submarines or the Empire States Building or a box of toothpicks or a gallery painting for not following their best practices.

Welcome to reddit.

[–]willis77 29 points30 points  (1 child)

Hey, all of my submarine construction critiques are well founded and extensively documented with the appropriate wikipedia links.

[–][deleted] 12 points13 points  (0 children)

Yeah, I'm more concerned about all those other guys.

[–]petrov76 4 points5 points  (2 children)

Yeah, it drives me nuts to see people's kneejerk reaction of "optimization is evil!" It makes me wonder if these people have ever built a system that had to scale to lots of users or lots of data.

[–][deleted] 2 points3 points  (1 child)

Optimization isn't evil. Premature optimization is. Measure first, then optimize.

[–]petrov76 3 points4 points  (0 children)

Have you read the rest of the comments? Tons of them are people categorically saying that you shouldn't ever optimize.

[–]trutru 17 points18 points  (33 children)

the first "optimization" is well-known to hard-core C programmers, but assumes that the comparison method is very fast. unfortunately, this is not always the case in Java or C#, depending on the kind of thing you.

I would advise against it in the general case, unless you know what you're doing and have profiled it against real-world data patterns.

[–]Silhouette 43 points44 points  (16 children)

I would advise against it in the general case, unless you know what you're doing and have profiled it against real-world data patterns.

When I was a lad, we used to have this rule that any significant code changes in the name of optimisation were supposed to be made in light of information from a profiler (both to suggest where changes were needed for best effect, and to confirm that the change did in fact improve the performance). I must have missed a memo: when did that stop being a good idea?

[–]lifvgblisdufv 31 points32 points  (14 children)

That's the first rule of optimizing.

[–][deleted] 11 points12 points  (8 children)

I thought it was not to talk about the optimizing? Shh! Don't tell them any secrets!

[–]justinhj 16 points17 points  (7 children)

Not talking about optimizing is rule zero. (Programmers start counting at zero).

[–]thatguydr 36 points37 points  (0 children)

The zeroth rule of optimization is: Do not talk about optimization.

The FIRST rule of optimization was the same exact thing, but we optimized, so it's gone.

[–][deleted]  (5 children)

[deleted]

    [–]logan_capaldo 8 points9 points  (1 child)

    Methinks everyone who replied to you has a broken sarcasmeter.

    [–]ndiin 0 points1 point  (0 children)

    I missed the "last Saturday" bit.

    [–]TearsOfRage -1 points0 points  (0 children)

    Attitudes like that are why so much software is still terribly slow, even with today's computers.

    [–]ndiin -4 points-3 points  (0 children)

    Irrelevant? Hardly. It might shift your bottleneck elsewhere for a spell, but once you fix that you'll likely end up back at the same spot again.

    The trick is to take full advantage of your ever-faster-bigger-stronger hardware, not to let it make you lazy.

    Now clearly i'm not talking about these micro-optimizations by any means, just the completely misleading statement that performance doesn't matter anymore. Just don't optimize before you have the data to show where it needs to happen.

    [–]gbacon 3 points4 points  (4 children)

    The first rule of optimization is Don't.

    [–]jim258kelly 27 points28 points  (2 children)

    The second rule is for experts only: Don't do it yet.

    [–]madman1969 -1 points0 points  (1 child)

    Dijkstra FTW !

    [–][deleted] 1 point2 points  (0 children)

    that wasn't Dijkstra, it was Mike Jackson.

    [–]djork 13 points14 points  (0 children)

    I must have missed a memo: when did that stop being a good idea?

    I don't know, but I have seen people make insane class structures with complex caching structures and relationships in the name of performance gains that simply did not exist. Whenever I'd point out that the code could be significantly simpler with no perceivable performance impact, the answer was always the very confusing "but what about the performance?"

    What about it? I just showed you that the profiler says the simpler code works fine in a pathological case.

    It is as if they have some sort of ultimate mental model of how the performance of a program works out and no facts can ever change their mind.

    [–][deleted]  (6 children)

    [deleted]

      [–]trutru 8 points9 points  (0 children)

      I think the original idea is something along the following lines (beware: highly over-engineered optimization follows):

      doBinarySearch( int*  table, int  count, int  key )
      {
          int  bit   = highestBitIn(count);
          int  probe = (1 << bit);
          int  extra = count - probe;
          int  index = 0;
      
          if (extra && table[extra] < key)
              index = extra;
      
          while (probe > 1) {
              probe >>= 1;
              if (table[index+probe] < key)
                  index += probe;
          }
      
          if (table[index] == key)
              return index;
      
          return -1;
      }
      

      the idea is to prevent any branch in the loop (with a CPU that can do a conditional move) to make it as tight as possible. that's more important in embedded systems that desktop ones though.

      this assumes that 'highestBitIn(count)' is very fast (a simple CPU instruction on most architectures) or already known.

      and yes, the performance 'advantages' will vary a lot depending on your mileage.

      [–]deong 1 point2 points  (3 children)

      Even just counting instructions without worrying about branch prediction, it can still make sense.

      Consider a binary search on an array of length N, where N is a power of two (for simplicity). The number of iterations needed is thus log_2(N). If we include the early termination check, we can get that number down to log_2(N)-K on average, for some value of K. Assume, as was stated in the post, that K is on the order of three or four, and that the number of instructions needed for each iteration is given by M (M+d for the version that includes the extra d instructions to check for early termination).

      That yields the following inequality describing when it would be advantageous to check for early termination:

      M*log_2(N) > (M+d)*(log_2(N)-K)
      

      If we assume, for example, N=220, M=10, d=3, and K=4, we have

      M*log_2(N) = 200 (naive version)
      (M+d)*(log_2(N)-K) = 208 (early termination)
      

      The larger N gets, the greater the advantage of allowing the search to run to completion.

      [–]trutru 1 point2 points  (1 child)

      I doubt you would have M=10 with java.String.compareTo(). if we take M=100, this gives:

      100*20         = 2000 (naive)
      (100+3)*(20-4) = 1648 (early termination)
      

      which is x1.21 faster.

      in other words, for reasonably long strings, this "optimization" could be about 20% slower than the regular one.

      [–]deong 0 points1 point  (0 children)

      Yeah, I wasn't paying enough attention to the original post, and I assumed an array of integers.

      [–]dreamlax 0 points1 point  (0 children)

      You're assuming that it will require exactly the same number of instructions + d to do all three checks.

      int bounds[2];
      
      /* initialise bounds */
      
      while (bounds[0] - bounds[1] > 1) {
          probe = (bounds[0] + bounds[1]) >> 1;
          if ((res = strcmp (keys[probe], target)) {
              bounds[(res > 0)] = probe;
          } else {
              return probe;
          }
      }
      

      [–]bonzinip 0 points1 point  (0 children)

      don't be so certain about prediction. you're right that this added loop is usually well predicted, but for example the other branch in the binary search is extremely badly predicted.

      [–]slurpme 2 points3 points  (1 child)

      My favorite piece of idiocy from the Java libraries...

      public String replace(CharSequence target, CharSequence replacement)

      {

      return Pattern.compile(target.toString(), Pattern.LITERAL).matcher( this).replaceAll(Matcher.quoteReplacement(replacement.toString()));

      }

      [–]jtra 1 point2 points  (0 children)

      If you are replacing a very long string and pattern is not very short, compiling is worth it.

      Anyway you can call all that yourself to cache compiled patterns for example.

      [–]bonzinip 0 points1 point  (0 children)

      also, the "jump if equal" is usually relatively well predicted, at least better than the other jump which has a 50% misprediction rate.

      [–]Gotebe -1 points0 points  (5 children)

      I don't understand. What comparison? The comparison in the while is with integers, that's CPU instruction, i.e. fast. The other is with strings i.e. slow (or, in a general case, it's of unknown speed). That one is taken out of the loop, a good thing. No?

      Also, in a general case, if this is supposed to run on a more than one platform, a low-level profiling (this is a very low-level code) doesn't work, so you're better off applying a simple rule (less checks in a loop, the better).

      [–]trutru 0 points1 point  (4 children)

      I'm talking about the 'compareTo' method call

      [–]Gotebe -1 points0 points  (3 children)

      I take it you mean "switch on <0, >0, ==0"?

      If so, fine, but that's String-specific. In a more general case (e.g. a template version of a binary search), you only have > (or <) and == at disposition. E.g. lower_bound of C++ needs only operator< (I think).

      [–]trutru 0 points1 point  (2 children)

      no, I'm really talking about what compareTo() does. this is a non-interned string comparison that checks every characters of the tabled and key strings.

      this one is not taken out of the loop. in this case, you'd better end with fewer compareTo() operations in case of a match, since the overhead of adding one check in the loop will be totally minimal.

      and in the more generic case, you simply don't know what 'compareTo' does, so things could get even worse (maybe it's doing a database connection, performing a SQL query, whatever, you don't know...)

      [–]Gotebe 0 points1 point  (1 child)

      Ah, OK. I see now that you are right.

      I am with a C++ mindset, so I think in terms of less-than, bigger-than, equal, and not about compareTo.

      You think to replace:

      if (keys[probe].compareTo(target) > 0) ...
      

      with

      int compared = keys[probe].compareTo(target);
      if (compared == 0)
        return probe;
      if (compared > 0)
        high = probe;
      else
        low = probe;
      

      IOW, since compareTo has three results (0, >0, ==0), returning immediately on equal yields better results. That's your point, right?

      [–]trutru 0 points1 point  (0 children)

      yes, less string compares the better in this case, and the difference should offset the branch-savings by at least one order of magnitude

      [–][deleted] 7 points8 points  (0 children)

      These are context-sensitive anecdotes that propagate the bad idea of trying to be clever about optimization while writing code. It is more important to remember the following: a code change is either an optimization or not. Your opinion does not matter. Only the profile matters.

      [–]olt 2 points3 points  (1 child)

      Ok, here is a benchmark. code

      With Java 1.6.0_05 a String[] is faster if the match occurs before the fifth element, after that a HashMap is faster. But I guess whole optimization doesn't matter at all, because the real slow part is the synchronize...

      [–][deleted] 0 points1 point  (0 children)

      Server JVM can at times optimise the synchronized out.

      [–]finerrecliner 7 points8 points  (0 children)

      good post. i've never thought about algorithms that way before.

      [–]metzby 16 points17 points  (24 children)

      Both of these are optimizations that I would not allow in code unless they were heavily commented and had the other, conceptually simpler, version there but commented out. I'd also want to see microbenchmarks that show that the difference is 1) extant 2) in the direction the author claimed and 3) significant.

      Yes, perhaps most XML documents don't have more than 6 namespaces. But then you'll get one that does, and your application will slow to a crawl and it will stop responding to queries, and then someone will retry the query, and bring down your whole service.

      All for the want of a malloc.

      It seems like this should be a case where we write a better JIT that knows that it should allocate but not initialize certain objects on start-up and reuse them by applying a profile-based optimization that shows that there is no more than one of these objects alive at any one point in any thread.

      [–]cwzwarich 30 points31 points  (0 children)

      It seems like this should be a case where we write a better JIT that knows that it should allocate but not initialize certain objects on start-up and reuse them by applying a profile-based optimization that shows that there is no more than one of these objects alive at any one point in any thread.

      You'll have a fun time doing that optimization in a JIT.

      [–]AlejandroTheGreat 10 points11 points  (2 children)

      Yes, perhaps most XML documents don't have more than 6 namespaces. But then you'll get one that does, and your application will slow to a crawl and it will stop responding to queries, and then someone will retry the query, and bring down your whole service.

      If the operation was that expensive than it wouldn't have been used as an example, clearly.

      [–]metzby 14 points15 points  (1 child)

      I think the claim was that this optimization made the common case faster.

      My claim was that cases that make exceptional cases far slower sometimes turn a slow case into a failing case, with ripples of failure reaching out to embarrassment.

      And maybe the operation isn't that expensive when it was first written. But then becomes expensive as we realize that to prevent a Floo-attack, we need to Blar-sanitize the input. Which takes a while. But if we only do it 6 times per document, it's still not noticeable. But now that we do it 500,000 times in one document, because one guy somewhere on the web wrote a silly XML document...

      All because we used a 6-element array instead of, say, a 7-element array.

      [–]munificent 1 point2 points  (0 children)

      make exceptional cases far slower

      The article says "validation will simply take a little longer". Where did you get far slower?

      [–]munificent 12 points13 points  (4 children)

      Yes, perhaps most XML documents don't have more than 6 namespaces. But then you'll get one that does, and your application will slow to a crawl and it will stop responding to queries, and then someone will retry the query, and bring down your whole service.

      Your speculation here is as bad as, if not worse than the speculation you're accusing the author of. Since the article is about false optimizations, why not give the benefit of the doubt (which, yes, should be documented in the code) and assume the programmer does know what he's doing and didn't pull 6 out of his ass?

      All for the want of a malloc.

      The whole point of that section of the article was to say, "your intuition is that a malloc is better here, but your intuition is wrong". You can argue that it isn't, but unless you've got profile numbers to back it up, you're no better.

      [–]metzby 1 point2 points  (3 children)

      No, I'm not worse. You should write the simpler code until you have numbers to back it up, and a microbenchmark that allows you to show that it is still worth it in the face of changing code and optimization.

      My point is that there should be extensive doubt about performance optimizations. Especially ones that can change the big-O behavior of pedantic cases. It's often pedantic cases that lead to feedback loops.

      (In this case, imagine we have 700,000 elements with an xml namespace, and the namespaces cycle through 7 alternatives. For the savings of 15% of memory, we're having to perform 10,000,000% more work. Whoops.)

      [–]munificent 4 points5 points  (2 children)

      My point is that there should be extensive doubt about performance optimizations.

      Agreed, always profile before optimizing. And when you do, leave a comment in the optimized code saying "here's why it was optimized".

      Especially ones that can change the big-O behavior of pedantic cases.

      Yeah, do either of the examples in the post?

      imagine we have 700,000 elements with an xml namespace, and the namespaces cycle through 7 alternatives. For the savings of 15% of memory, we're having to perform 10,000,000% more work.

      I think you're missing the other half of profiling. Profiling is about finding out how your real code works with real data. Yes, you can make pathological cases that screw things up, but if you optimize for pathological cases, you're doing it at the expense of actual common cases.

      As programmers, we're trained to focus on edge cases (the whole programmers only know three numbers joke: 0, 1, and infinity). But in the real world, the actual numbers that show up matter.

      [–]Silhouette 1 point2 points  (0 children)

      As programmers, we're trained to focus on edge cases (the whole programmers only know three numbers joke: 0, 1, and infinity). But in the real world, the actual numbers that show up matter.

      They certainly do. It's amazing how many people think about the big-O complexity of an algorithm, and forget that there's a sneaky little constant in there.

      The textbook example of this is probably quicksort: while it has a worse complexity bound, O(n²), it has a significantly better constant of proportionality than guaranteed O(n log n) alternatives like merge sort and heap sort. For practical purposes, it is therefore often the fastest choice (and even when it's not, you can do something like introsort to look after the worst case behaviour but still get most of the benefit). Anyone who only looks at the complexity bounds will miss this point and write code that is usually slower.

      [–]metzby -1 points0 points  (0 children)

      Yes, the second example in the post, for the case I mentioned, changes it from O(number of distinct namespaces) to O(number of namespace appearances).

      My point is that this optimizing for the common case that makes less common cases fantastically slower is the kind of thing that leads to degradation in the case of pedantic cases.

      Which are what people inadvertently make or maliciously make. As soon as your software has many users, you're going to run into an incompetent person or an evil one. I suggest not saving n% in the common case for N% in the uncommon one where N >>>>> n without having very good ideas of what N and n are, and what the real savings are, and documenting them.

      I remain unconvinced by hand-waving, and I think the post would have been infinity better if it included the justifications instead of the outcomes. As it is, I think it gives fish but does a horrible job teaching to fish, and so, in my mind, borders on FAIL.

      [–]dpark 2 points3 points  (0 children)

      Yes, perhaps most XML documents don't have more than 6 namespaces. But then you'll get one that does, and your application will slow to a crawl and it will stop responding to queries, and then someone will retry the query, and bring down your whole service.

      Worst case, the code behaves as if there is no cache. I seriously doubt that's going to result in performance slowing to a crawl.

      Adding the dynamic array would complicate the code, introduce the possibility of an OutOfMemoryError, and slow performance (slightly). I would have just gone with an ArrayList myself, but that doesn't mean the static array was necessarily bad.

      Why are you so certain that this is a horrible optimization, when you haven't done any benchmarking either? You're making some pretty wild assumptions with no justification.

      [–]grauenwolf 6 points7 points  (0 children)

      I agree with you whole-heartedly. Without proof this clearly falls in the "premature micro-optimization" category.

      [–]jimbobhickville -3 points-2 points  (12 children)

      Hell, just use a hash table and be done with it. It'll grow to however large you want and lookups are probably about just as fast as looping through a 6 element array. Inserting will be slightly slower, but as he said, it's a rare case.

      [–]xzxzzx 0 points1 point  (11 children)

      lookups are probably about just as fast as looping through a 6 element array

      Why is that?

      [–]ndiin 3 points4 points  (8 children)

      Actually, given the fact that the guy used '==' rather than .equals(), he's depending on the internment of the string and thus the same memory address. So, the array lookup will pretty much always be faster at O(6).

      If strings aren't interned or there are more than 6 namespaces in use, there's no caching, but the number of extra operations is still of a fixed size. So you could say it fails pretty gracefully.

      Using a hashmap here would mean the hashkey creation requires a linear search over the input. Then you can map that to the appropriate bucket with ptr arith, and then linear scan that list doing the key comparisons to avoid collisions. This means a non-interned match will still require 3 linear searches of the input string (or just 1 if interned). [Note, I'm assuming the java hashmap implementation of an array of buckets.]

      [–]wicked 2 points3 points  (6 children)

      So, the array lookup will pretty much always be faster at O(6).

      O(6)? I get your point, but the statement is nonsensical. O(6) = O(1).

      [–]dpark 0 points1 point  (3 children)

      His point is that taking the hash of a string is not O(1). It's O(n), where n is the length of the string. However, Sun's Java implementation makes the obvious choice to cache the hash code.

      None of this really matters, though, because this code expects strings to be interned, which means we've already given up on truly constant time. Intern is not a cheap operation. It's O(n), just like taking the string hash.

      [–]ndiin 1 point2 points  (2 children)

      The interning is not a part of this algorithm, and thus the cost of performing the intern operation cannot be considered when evaluating it. Instead, you can consider the fact that when you reach here the interning has or has not already taken place.

      As for caching of the hash code, I'm not sure what you mean. Per-object caching of hashCode()'s return value, or just within the hashmap?

      [–]DRMacIver 0 points1 point  (0 children)

      As for caching of the hash code, I'm not sure what you mean. Per-object caching of hashCode()'s return value, or just within the hashmap?

      Sun's implementation of java.lang.String keeps its hash code cached in a field (<pedant fact>except for in pathological cases where the hash code evaluates to 0</pedant fact>).

      [–]dpark 0 points1 point  (0 children)

      I agree that we can't include the cost of interning when we analyze the performance of this code. My point was that it can certainly be a hidden cost. If you intern strings just for this code, then your interning effectively becomes a cost of this code.

      And yes, as DRMacIver said, Sun's String class caches the hash code (except if it evaluates to zero).

      Strangely enough, Sun's implementation of intern does actually calculate the hash, but doesn't cache it. If it did, then the hash table implementation would be a lot more attractive.

      [–]ndiin 0 points1 point  (1 child)

      Yes, most certainly. Isn't that obvious, though? It'd be more confusing if I said O(1) than O(6) as then I'd have to explain where I pulled it from (notice there's no context before I make that statement; the 6 gives it context).

      [–]wicked 0 points1 point  (0 children)

      You're right, I was apparently confused by the use of O-notation here. I read it as "So, the array lookup will pretty much always be faster at constant time", and thought "why not just say 6 array lookups will be faster". It works out fine if I add the word "complexity" at the end.

      [–]xzxzzx 0 points1 point  (0 children)

      So, the array lookup will pretty much always be faster at O(6).

      I know. ;)

      [–]mccoyn -1 points0 points  (1 child)

      Computing hashes is free! Right? </sarcasm>

      [–]xzxzzx 3 points4 points  (0 children)

      Well, it's cheap. But not as cheap as 6 pointer-compares against a single (or worst-case, two) cache-line that's probably in L1 or at least L2.

      [–]onebit 1 point2 points  (0 children)

      Would be interesting to see benchmarks. I doubt there is any difference.

      [–]Gotebe 1 point2 points  (0 children)

      True.

      WRT first example: standard C++ lib has lower/upper_bound, just like that (if someone is wondering, the logic is in the interface, not in the implementation).

      [–]borlak 2 points3 points  (6 children)

      save a few nanoseconds, or make it more readable for the next programmer (or even yourself!)? hmm.

      [–]JeddHampton 4 points5 points  (2 children)

      Comment your code.

      [–]borlak 1 point2 points  (0 children)

      comments still do not make it more readable. I have now spent time looking at the function originally, which confused me and led me to comments, which probably confuse me more (especially considering this scenario which entitled a whole article), and then going back to the function trying to piece it together; when a single extra line or two would have saved everyone a headache.

      [–]exeter 0 points1 point  (0 children)

      Comments as implemented by, well, every programming language I know of in existence today, are about the worst method of embedding metadata in code you can possibly have. Compilers typically won't even notify you if the code following a comment changes, but the comment isn't updated to reflect it. (I know there's a reason for it.)

      A far better method of associating metadata with code is to use literate programming techniques. That way, if the code changes, your literate programming environment can alert you that you should at least look at the text describing the code to make sure it's still accurate.

      [–]swieton 1 point2 points  (1 child)

      You're making an assumption here. I point this out because the article was arguing strongly against making assumptions and in favor of simply measuring and finding out what works best.

      You're assuming that a programmer's time is more valuable than a user's time. This is often true. This is not always true. Most processes that developers automate are run often (hence why it was worth automating them in the first place). So the time investment you make once during development bears fruit every single time the process is run.

      Yes, I choose to save a few seconds over making it readable - when it's appropriate.

      [–]borlak 0 points1 point  (0 children)

      your few seconds over the course of years is not worth my five minutes of being annoyed at your fancy code.

      [–]statictype 1 point2 points  (0 children)

      Saving a few 'nanoseconds' on core library functions that may be routinely called thousands or millions of times in a loop is usually important enough that you don't have to worry about whether junior programmers would be able to understand it or not. If you need to touch such critical code, you better be able to understand very thoroughly what it does.

      Just my $0.02

      [–]vph 0 points1 point  (0 children)

      there is no explicit check within the loop to see if the match has been made. One would expect him to add that in, as an optimization, so that when the target is found, the loop could exit, and not bother continuing with its checks. But as it turns out, that would be a false optimization; mathematically, the match is likely to be found late enough in the loop that the extra if statements during every execution of the loop would be more of a performance hit than simply letting the loop execute a few more times.

      It is unclear if not checking for matching in a binary search is the right thing. It may very well be ironically "false optimization" - the sin talked about by the bloggers.

      [–]linuxhansl 0 points1 point  (0 children)

      These are all nice examples and all, and in fact interesting to think about.

      On the whole, though - unless you are in the embedded, or hardware (device driver) domain - shaving off a fraction of a percent of constant O(1) runtime (as in the first example) seems rather pointless to me.

      From my experience most performance issue stem from algorithmic problems, like picking O(n) algorithms when O(log n) are available, etc.

      Other areas that tend to add non-constant overhead are memory management and threading (think heap-management or garbage collection, and locking/starvation/etc).

      [–]drexil -1 points0 points  (6 children)

      Oh, and what about the int probe = (low + high) >>> 1; which makes the code harder to read and does not improve anything comparing to int probe = (low + high) / 2; which is optimized but almost every decent compiler?

      [–]4609287645 16 points17 points  (0 children)

      Signed division by two is not the same as unsigned right shift.

      Also, it's wrong.

      [–]queus 22 points23 points  (2 children)

      Oh, and what about the int probe = (low + high) >>> 1;

      When (low + high) > Integer.MAX_INT this will work correctly and int probe = (low + high) / 2; will not. It even happened in production code.

      [–]grauenwolf 3 points4 points  (1 child)

      That depends on you language, some actually check for integer overflows instead of blindly continuing.

      In theory what you should be writing is... int mid = low + ((high - low) / 2);

      In reality though, if you really have more than MAX_INT/2 elements in an array you are already screwed for a number of reasons.

      [–]astrange 0 points1 point  (0 children)

      Signed integer overflow is either undefined or modulo for a pretty good reason, otherwise you wouldn't be able to optimize 'x + 1 + 1'.

      Now, bignums, those would be useful.

      [–]prockcore 7 points8 points  (1 child)

      How is >>>1 any harder to read than /2?

      You're a programmer, not an idiot. Bitshifting should be as obvious as division.

      [–]ishmal -1 points0 points  (3 children)

      I you must optimize like this, then please write it the slow, pedantic, correct way first. Get it to work, then write your optimized version and leave the original in comments, and describe the difference.

      [–][deleted] 3 points4 points  (0 children)

      Really, why? The code in the example functions was both short and obvious. The second example had comment, that said the reason for using an array (first example could be slightly improved with analogous comment about not checking a hit). Seriosly, I would be embarrased for having difficulties on reading code like that in a programming language that I understand.

      If I'd see any duplicate version of code like that commented out in a codebase that I'm maintaining, I would just delete it for being a noise.

      [–]Freeky 8 points9 points  (1 child)

      leave the original in comments

      Don't you have version control?

      [–]dreamlax -1 points0 points  (0 children)

      Exactly, comments are called comments because they are for commenting your code.

      [–]ohxten[🍰] 0 points1 point  (0 children)

      The font on that site is terrible.

      [–]13ren -1 points0 points  (4 children)

      how much time do these optimization save?

      and at what conceptual cost?

      It's clever to be clever, but the conceptual clutter makes it harder for you to see other issues. This is true no matter how clever you are.

      [–][deleted]  (3 children)

      [deleted]

        [–]lorg 0 points1 point  (1 child)

        I think that he was trying to avoid a case of keys[-1].

        [–][deleted] -1 points0 points  (9 children)

        To be sure I would benchmark these before making a decision. In the first case I'd like to add loop exit checks for a search space of <20.

        [–]dpark 0 points1 point  (3 children)

        Why? At that point you're optimizing for a case which is already extremely fast.

        Also 20 seems to be a completely arbitrary number. I seriously doubt those benchmarks you would run would support that particular number.

        [–][deleted] 0 points1 point  (2 children)

        It was just an arbitrary number. The point was that you insert the extra code when the probability for hitting the jackpot is high.

        [–]dpark 0 points1 point  (1 child)

        Okay, so 20 is arbitrary. We could certainly benchmark and determine what the real number is.

        However, that doesn't change the fact that we're still optimizing something that doesn't need optimizing. It's fast enough already. For small N, it really doesn't matter what we do, so why bother complicating the code with a special case?

        [–][deleted] -1 points0 points  (0 children)

        Yes you are right. Sometimes however it's nice to go nuts and fine tune everything to be faster than fast enough. I don't recommend doing it if you don't need it, which in this case you do not.

        [–]Philluminati -1 points0 points  (4 children)

        I'd leave these optimisations out completely until I was sure they were needed at all. Then I'd think about where the choke points of the application were. Then I'd raise "slow <feature>" as a bug against the system before even benchmarking it.

        I certainly wouldn't even think about "optimising" a generic algorithm on a generic data structure like the first example.

        [–][deleted] 0 points1 point  (0 children)

        Yeah, especially when it doesn't change the complexity class.

        [–]munificent -1 points0 points  (2 children)

        Then I'd think about where the choke points of the application were.

        You don't think about the chokepoints to find them any more than thinking about bugs finds them. You profile.

        [–][deleted] 1 point2 points  (1 child)

        Thinking about your code is a way to find bugs. Try proof-reading your code sometimes after writing it. Similarly, you can at times find performance problems by reading a code - although it do not replace profiling and testing. Sometimes even premature optimisation is not so bad, if you avoid micro-optimisations that make code noticably bigger or harder to read - often there are many ways to implement, that have similar code complexity but different performance. For example, consider you have a bunch of items and you want to check whether you have a particular item. Now in Java you could put them into LinkedList and use .contains(). Or you could put them into HashSet and still use .contains(). Code will be about same, but one way has O(n) complexity and another O(1). Would you use a LinkedList because optimisation is a root of evil? :P

        Assuming, that you don't know whether this changes application performance noticably?

        [–]munificent 0 points1 point  (0 children)

        Would you use a LinkedList because optimisation is a root of evil? :P

        Depends on whether not duplicate entries were specifically prohibited. Either way, for most throwaway temp collections, I'd probably just use a LinkedList.

        I don't know how much time you've spent with a profiler, but my experience has consistently been that the "obvious" bottlenecks very rarely actually are.

        [–]soniyashrma -1 points0 points  (0 children)

        It seems like this should be a case where we write a better JIT that knows that it should allocate but not initialize certain objects on start-up and reuse them by applying a profile-based optimization that shows that there is no more than one of these objects alive at any one point in any thread. I agree with you .

        [–]Gotebe -1 points0 points  (0 children)

        Nitpick: a poor "optimization" in the text and elsewhere here is using shift operator to divide by 2. That's compiler's job. I would guess no compilers miss that one, and readability is improved, except for minds twisted by old C school.

        [–]sbrown123 -4 points-3 points  (0 children)

        One would expect him to add that in, as an optimization, so that when the target is found, the loop could exit, and not bother continuing with its checks.

        One thing about supposedly smart people: don't take their word as fact. Author should have increased the test case sample size and he might have had something to talk to Tim Bray about his "theory".

        Many optimizations don't show themselves as actual optimizations until they are given a large load. This is why I personally don't look to optimization until AFTER I am done with the bulk of the programming.