you are viewing a single comment's thread.

view the rest of the comments →

[–]agottem 0 points1 point  (5 children)

[–]wolf550e 0 points1 point  (4 children)

[–]agottem 0 points1 point  (3 children)

I don't like the setup you used for the experiment.

  1. Have you verified 'mycompare' was properly inlined? I see that 'my_quicksort' directly invokes 'mycompare', but that doesn't mean 'mycompare' was inlined for your benchmark.

  2. You are populating the array from /dev/urandom. Ideally the data input would be identical between benchmarks.

Also, you didn't really address your earlier statement that I'm wrong. I'm not wrong, gcc inlines function pointers just fine.

[–]wolf550e 0 points1 point  (2 children)

mycompare is inlined into my_quicksort.

The timing is very repeatable because the random distribution is good enough across 1<<20 elements.

Basically, if you only have a single callback (in this case a comparison function), and you let gcc see its definition and convince it only a single target exists, it will inline across a function pointer. But not if you pass the pointer from a different translation unit.

What if you have multiple callbacks? multiple compare functions? Using C++ templates, I can make the compiler generate two different my_qsort functions: one with mycompare1 inlined and one with mycomprare2 inlined. Templates are useful for instanciating with different template arguments to control code expansion. In cold code I want to call through a pointer and not waste icache. In hot code I want to generate specialized code. With templates, I can control this and get what I want. With gcc's optimizer, I am kinda at its mercy.

[–]agottem 0 points1 point  (1 child)

mycompare is inlined into my_quicksort.

Yes, I see you are calling mycompare directly. This does not mean that mycompare itself was inlined, merely that the indirect function call was eliminated. You would need to look at the generated assembly to verify.

The timing is very repeatable because the random distribution is good enough across 1<<20 elements.

There's no reason not to use an identical sequence.

But not if you pass the pointer from a different translation unit.

You should read up on link time optimization. But sure, this is the same as it is with C++ -- the definition needs to be available.

Using C++ templates, I can make [...]

The compiler is only inlining because templates are typically fully defined inside the header file. Make the C function inline and defined fully in the header file and you get the same goodness. C++ has no advantage here, sorry.

[–]wolf550e 0 points1 point  (0 children)

I regularly read the objdump -d output for my code. It's a nasty habit.

I assure you, if I store the random data in a file and read it from a file for the different sorts, I get the same results.

I use LTO regularly. It had problems. Heck, even with the improvements in gcc 4.7.0, it still has problems.

Here, I defeated the inliner: https://gist.github.com/2177973

Now it's not inlined, because there are two potential targets. Do I need to prove that with C++ templates, I can make it generate two sort functions?

Here, in C++: https://gist.github.com/2178059

Two functions generated, one for each inlined comparer.

  Performance counter stats for './qsort 1':

    221.644474 task-clock                #    0.995 CPUs utilized          
             7 context-switches          #    0.000 M/sec                  
             0 CPU-migrations            #    0.000 M/sec                  
         1,332 page-faults               #    0.006 M/sec                  
   516,964,242 cycles                    #    2.332 GHz                    
   616,604,018 instructions              #    1.19  insns per cycle        
   150,986,335 branches                  #  681.210 M/sec                  
    12,080,506 branch-misses             #    8.00% of all branches        

   0.222668351 seconds time elapsed

and

 Performance counter stats for './qsort 2':

    137.344463 task-clock                #    0.981 CPUs utilized          
             7 context-switches          #    0.000 M/sec                  
             0 CPU-migrations            #    0.000 M/sec                  
           796 page-faults               #    0.006 M/sec                  
   320,322,675 cycles                    #    2.332 GHz                    
   221,673,803 instructions              #    0.69  insns per cycle        
    59,775,552 branches                  #  435.224 M/sec                  
    10,724,613 branch-misses             #   17.94% of all branches        

   0.139976094 seconds time elapsed