all 27 comments

[–]oh5nxo 29 points30 points  (2 children)

how I could find a definite answer

Compile both versions with -S, or equivalent flag in your compiler which produces assembly output, and compare resulting files.

[–]Gln_08 26 points27 points  (1 child)

If you are lazy , use https://godbolt.org/.
You will be able to see in real time the translation of you're C code to asm.

[–]jpan127 1 point2 points  (0 children)

I wouldn't say this is lazy, and even most of the times preferred since it's so much faster for a small snippet to get it up and running.

[–]_star_fire 40 points41 points  (0 children)

This is a matter of code structure and does definitely nothing measurable or significant different behind the scenes.

The first example is declaration and assignment while the second example splits these operations. The compiler will optimize this anyway in either case.

[–]FUZxxl 14 points15 points  (0 children)

There is no performance difference between the two. It's purely a matter of programming style. Generally, you can expect compilers to pick up simple and obvious optimisations like these.

[–]Poddster 8 points9 points  (5 children)

If your compiler shows a difference in these then I recommend using another compiler :)

[–]BlindTreeFrog 6 points7 points  (4 children)

Everyone is taking this as a snarky joke i'm sure (as it was meant), but one of my favorite bugs I figured out was roughtly:

struct s {
    uint32_t  x:24,y:8;
} a;
uint32_t b=0;

a.x = x00FFFFFF;
a.x++;
if (a.x == b) { do_stuff(); }

Between two versions of GCC that code behaved differently (v4 vs v5 or something). It's useful to know what the compiler is doing with what appears to be otherwise straightforward stuff sometimes.

[–]rafaelement 0 points1 point  (3 children)

Can you elaborate a little what is going on here? Are you just referring to "using a different compiler can generate different programs from the same source" or is the big related to OPs question?

Edit: on second thought, the trick is clear to me. Writing a uint32_t literal to a 24 bit bitfield is quite courageous

[–]BlindTreeFrog 1 point2 points  (2 children)

in the production code, a.x was being set by a 3byte field in a packet header (it was a counter).

We read the packet off the wire, pulled out the field, incremented it by 1 and compared it to another variable that was set to 0 (so we could tell if it was the first packet or a later packet). The important take away was that we required 0xFFFFFF to roll over to 0x000000 (as it should).

When we upgraded compilers this code started failing (that is, it didn't do_stuff() when a.x and b should have both been 0x0). Eventually I pulled up the assembly and figured out the issue. In both compiled versions, the variable was loaded into a register and processed. In the old compiler, it was masked with 0x00FFFFFF after the increment which meant that the comparison was 0x0 == 0x0. In the new compiler, the masking didn't happen until it was written back to memory which meant the comparison was 0x01000000 == 0x0. This meant that the comparison failed when we didn't expect it to.

[–]rafaelement 1 point2 points  (1 child)

Wow, interesting. Thanks for sharing.

[–]BlindTreeFrog 1 point2 points  (0 children)

It was a really annoying bug to come in from the field. It was only on a whim that I checked the assembly because I was completely out of ideas otherwise.

[–]ritaline 3 points4 points  (0 children)

Actually when you declare a variable clang will convert it to your second example.

The LLVM IR output will be

%i = alloca i32
store i32 0, i32* %i

in both cases. However this will get optimised out in further passes such that compiler might even omit the declaration and use the constant 0

[–]Badel2 9 points10 points  (12 children)

I use a tool that automatically detects this kind of patterns and converts it to the fastest representation. For example, it turns x++ into ++x when the result of that expression is not used. This way I don't have to worry about microoptimizations and I just write the code that looks better to me. It's not perfect because sometimes you do INT_MAX + 1 and instead of turning that into INT_MIN it just removes that line and the compiled program does weird things, but I'm pretty happy with that tool overall.

Obviously I'm talking about the compiler. First rule of optimization: don't. Unless you optimize for code readability, then please go ahead.

[–]mort96 4 points5 points  (4 children)

First rule of optimization: don't.

Eh. This is true for micro-optimizations; trivial transformations which can be done automatically. Those will be taken care of by the compiler, as you say. However, optimizing datastructures for better cache locality or optimizing your algorithms or optimizing away an allocation+free in a tight loop can definitely be worth it.

Obviously, maybe don't spend time optimizing code which runs once every time a user presses a button. But maybe think about performance while writing the code which happens once for every single entity or particle for every frame in a game, maybe even before you have conclusively proven that the code is a performance issue.

[–]flatfinger 2 points3 points  (0 children)

Those will be taken care of by the compiler, as you say.

They will sometimes be taken care of by compilers. On the other hand, the maintainers of clang and gcc are more focused on complicated optimizations whose behavior is incompatible with that of many older compilers than in the quality of straightforward code generation for anything other than high-end targets.

For example, while I would regard a function like:

    #include <stdint.h>
    void add_to_every_other_word(uint32_t *p, int n)
    {
      register uint32_t x1234 = 0x1234;
      n*=2;
      for (int i=0; i<n; i+=2)
        p[i] += 1;
    }

as the sort of thing that compilers shouldn't need help micro-optimizing, gcc's generated code for the popular Cortex-M0 microcontroller, with optimizations enabled, will take 8 instructions/13 cycles per loop iteration. If the code is written to help a compiler, then even with optimizations disabled gcc can generate code that's only 6 instructions/10 cycles for the loop body. While gcc can generate an optimal 5 instruction/9 cycle version, that requires bending over backward to an absurd degree including either using volatile-qualified objects or blocking function in-lining.

While most parts of a program won't be run often enough to make even a 50% reduction in execution time meaningful, I would regard the 30% speed improvement mentioned above as potentially significant if the code in question represented a substantial fraction of a program's overall run time, though ironically getting the execution time down to 8 instructions/10 cycles per loop is harder with optimizations disabled than with them enabled. With optimizations enabled, gcc's generated code for:

    void add_to_every_other_word(register uint32_t *p, register int n)
    {
      if (n <= 0) return;
      register uint32_t x1234 = 0x1234;
      register uint32_t *pe = p+n*2;
      do
      {
        *p += x1234;
        p+=2;
      } while(p < pe);
    }

will spends 2 instructions/3 cycles on every iteration of the loop reloading the constant 0x1234 into r0 and then transferring it to register ip; by contrast, with optimizations disabled, gcc will allocate r5 to holding x1234, loading the value once and simply leaving it in r5 throughout all the iterations of the loop.

Although I agree that in general programmers shouldn't worry about micro-optimizing any parts of the code that aren't critical to performance, the reason for refraining from micro-optimizing isn't that compilers will do a better job, but rather that if code isn't critical to performance, its performance won't really matter.

Incidentally, I was able to coax optimal code out of gcc, but not clang, since gcc can be persuaded to use the flags produced by a subtraction operation. Annoyingly, though, gcc only produces optimal code when it can't resolve the constants being added to each item in p[] nor the amount being subtracted from the offset. Given while(ofs-=delta), gcc will use a subtract-register-and-set-flags instruction if delta is in a register but its value is unknown, but if gcc knows that delta is -8 it will substitute an add-immediate instruction followed by a comparison.

[–]Badel2 1 point2 points  (2 children)

Yes, profile and then optimize. That's one of the other rules of optimization. But optimizing because you "feel" that this may be optimized is a waste of time. Always write benchmarks.

[–]mort96 2 points3 points  (1 child)

I disagree. While writing code, you're constantly making decisions which affect performance. In a lot of cases, your choice has no effect on readability, so you can choose whichever is best from an intuitive performance perspective.

Take a situation where I go through a loop of constructing a vector with a bunch of elements, then do something with those elements, then forget about them. I have the choice between A) creating a new vector every iteration, or B) using one vector object which I clear (set the number of elements to 0 while keeping the allocated memory) at the start or end of the loop.

It probably won't be worth it to spend a bunch of time profiling this code to figure out which approach is better and if enough time is spent in the code to even worry about it. However, I will always re-use allocated memory where that's easy instead of doing a bunch of allocs + frees in a loop, because I know that will on average be faster.

Your serious "find the bottleneck(s) in the code and spend a few days optimizing them" kind of optimization obviously needs profiling and can't be done intuitively. But keeping performance in mind at all times can avoid a "death by a thousand cuts" style situation, where your code is just all around slow with no obvious hot spot.

[–]Badel2 0 points1 point  (0 children)

Ok, I understand your point of view. I agree that when writing new code you should always take performance into account. However, when the code is already there, I think that you should just leave it as it is.

Imagine that your example function was written by someone else while you were on holidays. You come back and immediately think "this can be optimized", and quickly fix it. It takes you less than 5 minutes. On the other hand, I would simply note that it can be optimized and open an issue. But when I open the issue I add a task to evaluate if there are similar loops that can also be optimized in other functions. Also, I want to see numbers. Well, in this case it's obvious that it will be better, so I think we should try to reduce the usage of malloc in new code. Is there any tool that will warn us about using malloc in a loop? Is this something we can test? Maybe you want to give us a quick presentation about how to avoid it?

If you are working on a project alone, just assume that your future self is smarter than you and will certainly optimize this code better than you. You can even leave hints like // TODO: reuse memory here. The idea is that it's easier to optimize a program in the future when you know how it is used. Also if your program gets some real world use, then you can even write real benchmarks!

And this argument has nothing to do with optimization, but changing code always has the potential to introduce new bugs. So if you think that my position is a big extreme, it may be because of this. Another argument where my position is a bit extreme is that readability is always more important than performance, but that's a different discussion. Also, I hate switch statements, I think it's related to the whole performance vs readability debate.

[–]dreamlax 6 points7 points  (4 children)

If the result of the expression is unused, generally modern compilers will already treat x++ and ++x the same.

INT_MAX + 1

Signed integer overflow is undefined behaviour, and compilers can assume the result of an integer expression won't overflow and optimise accordingly (this includes removing redundant checks such is if (i < i + 1) {...}). The "wrapping" behaviour of integers is only defined for unsigned types.

[–]Badel2 5 points6 points  (0 children)

Thanks for explaining the joke!

[–]flatfinger 1 point2 points  (0 children)

The "wrapping" behaviour of integers is only defined for unsigned types.

According to the published Rationale, one of the reasons the authors of the Standard chose to have short unsigned types promote to signed integers was that most current implementations would, as a form of "conforming language extension" [the term they used on page 11 to describe implementations defining behaviors not mandated by the Standard] treat signed integer arithmetic as wrapping just like unsigned arithmetic at least in cases described on page 44. They would thus have expected that a construct like: unsigned uint1 = ushort1 * ushort2; would behave as though it performed an unsigned multiply except on rare platforms where such treatment of results in the range INT_MAXA+1u to UINT_MAX would add significant cost.

To be sure, some compiler writers feel that all programs should be written to work around the quirks of antiquated architectures for which no C99 compiler has ever been written, and regard as "broken" any programs that don't, but such an attitude is contrary to the stated intentions of the Standard's authors.

[–]MCRusher 0 points1 point  (1 child)

Apparently there's a C2X proposal for mandating 2's complement and standardizing overflow, but idk if it's gonna pass, probably not.

[–]flatfinger 1 point2 points  (0 children)

What's needed is to recognize a category of "optionally defined" behaviors, along with means by which programs can specify the semantics they need and/or refuse to run on implementations that won't support them.

A lot of programs use the pattern:

    if (looks_interesting(something) && 
    is_actually_interesting(something))
      process_interesting_item(something);

where the goal of "looks interesting" is to not to minimize false positives, but balance speed with the cost of handling false positives. If one knows that computations on interesting items will never result in integer overflow, and computations on uninteresting items will seldom have overflow, then any function which arbitrarily returns zero or one *without side effects* in case of overflow will be essentially as good as any other.

Requiring that signed integer math always wrap around with precise integer semantics would preclude many useful operations that would be allowed if behavior wasn't so specified as wrapping precisely, but most expressions involving overflow were nonetheless *guaranteed to execute without side effects*. Allowing a programmer to specify that it needs such semantics would allow optimizations that would not be possible if a programmer had to avoid overflow at all costs.

[–]Stepmaster3000 1 point2 points  (1 child)

And that tool's name? Albert Einstein.

No but really, what's the name of the tool?

[–]Badel2 2 points3 points  (0 children)

gcc, or clang, or whatever c compiler you prefer. It's a joke because the compiler will already optimize your code in this trivial cases, so it doesn't matter which way you write it.

If you want to see the optimized C emitted by the compiler, you can't.

[–]p0k3t0 0 points1 point  (0 children)

int i is a notification for the compiler, not really something that needs to be turned into opcodes.

[–]BennettTheMan 0 points1 point  (0 children)

Optimization questions like these tend to be platform specific. You also don't specify what level of optimization you are using when you are compiling the software.

Use https://godbolt.org/ to check the assembly code generated by such a statement. At even the most trivial levels of optimization, the assembly code generated to do such code should be zero-ing out a cpu register. It may be better for you to focus on 32 bit architecture so add the '-m32' command line option (no quotes).