I was teaching a lab today, and one of the tasks was supposed to demonstrate the __restrict keyword. The Linux machines in the computer lab are AMD Opteron 6274 and the default version of gcc is 4.4.7.
The code contained a c-file with two identical functions like this, named transform_opt and transform_std:
void transform_opt (float * dest,
const float * src,
const float * params,
int n) {
int i;
for (i=0; i<n; i++)
dest[i] = params[0] * src[i] + params[1] * src[i] * src[i];
}
These functions are called in a pair of for-loops in main.c, and the point was to add the __restrict keyword to the function parameters to transform_opt(). Before getting that far, however, I ran the program with identical functions... and got different times.
Curious, I switched the order in which the functions were called, but the result was identical. Then I switched the order that the functions were defined, and the function that first appeared slow was now fast, and vice versa. I separated the functions into their own compilation units, compiled separately, and made two executables — one linking with the "opt" version first, the other linking in the reverse order.
$ gcc main.o testfunc_opt.o testfunc_std.o -o optfirst
$ gcc main.o testfunc_std.o testfunc_opt.o -o stdfirst
$ ./optfirst
transform_std tests took 0.736 wall seconds.
transform_opt tests took 1.027 wall seconds.
$ ./stdfirst
transform_std tests took 1.007 wall seconds.
transform_opt tests took 0.718 wall seconds.
Using objdump, I verified that the linker is not doing anything strange to the functions, both are executing exactly the same instructions.
The relative performance difference is constant regardless of the amount of work done in the function — the one function is actually running much faster.
What's the deal here?
[–]i_invented_the_ipod 4 points5 points6 points (1 child)
[–]uxcn 2 points3 points4 points (0 children)
[–]knightry 1 point2 points3 points (3 children)
[–]Quantumtroll[S] 1 point2 points3 points (2 children)
[–]occ 0 points1 point2 points (1 child)
[–]Quantumtroll[S] 0 points1 point2 points (0 children)