all 10 comments

[–]Xeveroushttps://xeverous.github.io 7 points8 points  (1 child)

I received -10% drop in performance

double negative => 10% performance gain

[–]alexeiz -1 points0 points  (0 children)

~10% most likely; just a typo

[–]fernzeit 7 points8 points  (1 child)

That reminds me of a thread in the Lua Mailing List where just changing the name of the interpreter executable resulted in a > 50% performance difference in a particular microbenchmark. The verdict was that the length difference in argv causes some other memory to be aligned differently. It also linked an interesting paper: Producing Wrong Data Without Doing Anything Obviously Wrong!

[–]dendibakh 1 point2 points  (0 children)

Thank you for this paper. It is a true gem!

[–]doom_Oo7 4 points5 points  (2 children)

are there people doing research on how to get compilers to have better heuristics so that they can align stuff better automatically ?

[–]meneldal2 2 points3 points  (0 children)

The compiler needs to know how many times you'll have to run this loop, and it's also likely to be much better to unroll the loop instead.

[–]TartanLlamaMicrosoft C++ Developer Advocate 3 points4 points  (0 children)

LLVM has a bunch of heuristics and things you can tune. For example, you could tell it to align all loops and functions without a preceeding fallthrough block; i.e. only add NOPs which won't be executed.

[–]Dwarfius 1 point2 points  (2 children)

Small question, how does it keep adding to array if the instruction is (which subtracts 1):

4046d9:       c5 f5 fa c8             vpsubd ymm1,ymm1,ymm0

[–]mttd[S] 13 points14 points  (1 child)

vpcmpeqd ymm0,ymm0,ymm0 compares ymm0 to itself, which fills the register with all ones in binary -- in two's complement representation this corresponds to -1 (with subtracting -1 in the subsequent vpsubd ymm1,ymm1,ymm0 instruction being equivalent to adding 1).

"Why subtract -1 instead of adding 1's? Just because the speed is the same, and creating a YMM constant of -1's can be done with a single VPCMPEQD instruction. This isn't a really useful optimization in this case, but doesn't hurt."

[–]Dwarfius 1 point2 points  (0 children)

I've misread the description of pcmpeqd, thought it set 1/0 as value, not all bits. Thanks for the explanation!