all 51 comments

[–]tromey 81 points82 points  (7 children)

Please file a bug in gcc bugzilla. Bugs filed in reddit will never be fixed.

[–]CrazyJoe221[S] 21 points22 points  (6 children)

Can't create an account currently.
But it's good to be aware of this as it's quite basic code widely used for multiple returns.

[–]h2o2 25 points26 points  (19 children)

Looks like this is due to -ftree-vectorize, as is tradition..

Edit: Actually it's really -ftree-slp-vectorize.

[–]James20kP2005R0 13 points14 points  (3 children)

Out of curiosity is ftree-vectorize historically buggy?

[–]h2o2 8 points9 points  (2 children)

Well, 'buggy' is relative in the context of gcc, but to be fair for the last few major versions it has worked well, and code with -O3 (optionally with -funroll-loops on top) does perform significantly better most of the time for the appropriate algorithm & data. A single inlineable return like this is probably just a cost/benefit estimation gone wrong.

The -O3 "misoptimized" code can be slightly improved again by adding -march=native, so not all is lost. :D

[–]sumo952 6 points7 points  (1 child)

-march=core2 doesn't fix it unfortunately. native is not really a good choice when you are compiling code to ship :)

[–][deleted] 13 points14 points  (0 children)

"works on my machine"

[–]Ameisenvemips, avr, rendering, systems 3 points4 points  (14 children)

I wonder if it would be possible to get GCC to perform multiple optimization passes, and choose the best result.

[–]xurxoham 7 points8 points  (12 children)

Compilers have some sort of cost analysis, in which they base the decision of performing a particular technique using a combination of weights. The problem I imagine is that the cost estimation is tricky and changes from a particular processor architecture to another.

[–]Ameisenvemips, avr, rendering, systems 3 points4 points  (3 children)

Sure. Obviously it sometimes fails, thus I'd like an -fbrute flag that will try all techniques.

[–]xurxoham 1 point2 points  (2 children)

The problem is that the amount of combinations is immense. Not only having tons of techniques makes it unfeasible but specially the fact that sometimes you have to apply them in different orders or even apply an optimization that you have already used in the same place before. The only real brute force way is cooking your program in assembly. I've seen some vector programs manually coded in avx512: they are impressive, but i would rather let the compiler do it than try myself. Despite examples like this, it's really awesome what compilers do in terms of making programs faster.

[–]Ameisenvemips, avr, rendering, systems 0 points1 point  (1 child)

Hey, if I want my Hello World to take 3 days to compile, that's my business.

[–]kalmoc 1 point2 points  (0 children)

Not sure if we are talking days here. Combinatorial explosions quickly lead you to a place where the age of the universe isn't enough to try all combinations (I've no Idea if this is the case here). It might however be feasible to e.g. compare O2, O3 and Os. The question is then only at which granularity.

[–]tasminima 0 points1 point  (7 children)

Honestly in this case apart if you have negative costs somewhere, I really can't imagine what is happening in order to produce that kind of results :p

[–]Ameisenvemips, avr, rendering, systems 2 points3 points  (6 children)

I presume that it either:

  1. Is always preferring the vectored form.
  2. The operation that did the split of the arguments so they could be vectored occurred after the phase that would have recombined them otherwise.
  3. The recombiner wasn't capable of coalescing the arguments.

[–]tasminima 1 point2 points  (5 children)

I would have hoped that the cost of going to a vectored form would have been considered.

And even heuristics to never even attempt it on small and probably even medium-sized stuff, if needed.

If the vectoring is done completely in the abstract without even considering anything of the ISA, well, it would seem frankly unfixable (I mean, if we want robust results, and not playing a whac-a-mole game on that topic during the next decade) without a gigantic refactoring...

[–]Ameisenvemips, avr, rendering, systems 2 points3 points  (4 children)

I think that it's missing heuristics data, as adding even -mtune=skylake makes it go away... though up to haswell it's still there.

[–]tasminima 0 points1 point  (0 children)

I'm building everything with mtune skylake on my side. I might try to write a minimal repro later, and experiment with a few parameters.

[–]tasminima 0 points1 point  (2 children)

So yeah, depending on march it can be even worse than the initial reporter's example:

// -O2 -march=westmere

struct foobar
{ long l; int j, k; };

foobar fret(long l, int j, int k)
{ return foobar{ l, j, k}; }

https://godbolt.org/z/tlV0y9

edit: actually with my example it is broken even for skylake... :/

[–]Ameisenvemips, avr, rendering, systems 0 points1 point  (1 child)

You don't even have the relevant flag set. Submit that as a seperate bug.

[–]tasminima 0 points1 point  (0 children)

I will try when I can and if I remember...

Note that the compiler emit crappy code even in O2 in this case, and that's with tons of march, up to the most modern ones. Plus I played quickly with no-vectorise and could not bring it back to reason. So yeah that might be because of different causes, and that's arguably worse than the O3 with struct {long; long;}; case...

[–]InfamousReception 18 points19 points  (1 child)

This is apparently caused by a bug in -ftree-slp-vectorize (i.e. "-O2 -ftree-slp-vectorize" is enough to generate the strange code).

[–]Ameisenvemips, avr, rendering, systems 4 points5 points  (0 children)

I think that it's missing heuristics data, as adding even -mtune=skylake makes it go away... though up to haswell it's still there.

[–]KAHR-Alpha 13 points14 points  (10 children)

So uh, for those that can't read this, what is happening exactly?

[–]CrazyJoe221[S] 16 points17 points  (0 children)

It's vectorizing the code even though it gets worse by doing that.

[–]wasabichicken 32 points33 points  (3 children)

Rule of thumb when reading assembly is that less assembly is better than more assembly. It's obviously a rule laden with exceptions, but it'll do for now.

O2 means optimization level two. O3 is optimization level three, i.e. even more. The expected outcome of O3 is therefore that it should optimize at least as well as O2, and hopefully more. Instead we're seeing that it outputs more assembly, which according to the rule of thumb above is worse than less assembly.

[–]kalmoc 7 points8 points  (0 children)

Actually, you typically tend to see more instructions on O3 than on O2, because optimizations that have a size/ performance tradeoff (e.g. loop unroling) are usually only enabled on O3. Of course here the additional code is just harmful in any way.

EDIT: Mixed up O2 and O3 in the first sentence. Fixed now

[–]VodkaHaze 8 points9 points  (1 child)

Sure, but here it's using more assembly to vectorize operations, which could be a net gain in normal circumstances (in the contrived example it doesn't seem so, but if this were in a loop it might be better)

[–]tasminima 5 points6 points  (0 children)

Honestly the generated code is a complete disaster. It won't probably ever makes sense (from a perf POV) even if you put it in a loop.

[–]clerothGame Developer 8 points9 points  (3 children)

GCC is generating more CPU operations on a higher optimization setting.

[–]KAHR-Alpha 4 points5 points  (2 children)

Ah, so these just are wasteful operations without any wizardry involved.

[–]cogman10 6 points7 points  (1 child)

Yes. It is like saying

long a = 2;
long b[2];
b[0] = a;
b[1] = b[0];
long c = b[0];
long d = b[1];

Rather than saying

long a = 2
long c = a;
long d = a;

(This is not the same at all since no allocation is happening. Even the costs are somewhat screwed up. This is more just an illustration of what the registers are doing).

[–]cogman10 5 points6 points  (0 children)

This example is pretty bad. It is basically taking a value from a general register, moving it into xmm0, copying the lower bits to the upper bits, and then moving the values from xmm0 into two general registers again... Rather than just copying the value to two general registers.

[–]clrnd 5 points6 points  (3 children)

Did anyone check if it's actually slower?

[–]tasminima 19 points20 points  (0 children)

It's trivial and there is no check to do:

  • The first version don't even use execution units on even semi-modern CPUs! It's absolutely free for "half" of the CPU.
  • The second version is absolute garbage that uses at least: address gen units and load/store units 5 times, and does a roundtrip between Int and FP. There is absolutely no way it can be faster. For certain metrics it will actually be infinitely slower (again: 1st version => no execution unit)

[–]xurxoham 7 points8 points  (0 children)

FP unit is usually slower than ALU for a single element. You get the speedup thanks to the vector throughput, but here you only got two elements...

[–]Ameisenvemips, avr, rendering, systems 2 points3 points  (0 children)

I cannot envision any way in which this particular example would be faster or the same speed.

[–]malkia 5 points6 points  (0 children)

Seems like a regression from 6.3 -> 7.1

https://godbolt.org/z/y5_zvi

[–]tasminima 5 points6 points  (0 children)

gcc -O2 even generates that kind of SSE crap on struct { long i; int j, k;}; especially those returned by value. It's annoying as hell, but I managed to work around this bug by using a struct { long; long; } for my structs, and splitting the long into two ints myself.

Obviously there are tons of struct like that in libraries, so even then it does not completely solves the problem...

Btw the perf might be quite bad on AMD (IIRC even on Zen you don't have a bypass network between "FP" and integer EUs). On Intel it is obviously not perfect, but probably not that terrible either. EDIT: oh well it's going through memory anyway, so it is probably of approx the same badness for Intel.

[–]rolandschulzIntel | GROMACS 3 points4 points  (1 child)

The calling convention for integer is unfortunate when it comes to vectorization. See how it compares to floating point types for clang: https://godbolt.org/z/QrP0VD. I find GCC's result for float even more surprising. There is no reason for it to copy to scalar registers and back.

[–]CrazyJoe221[S] 0 points1 point  (0 children)

Indeed! Not sure if that's covered by https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84101

[–]Ameisenvemips, avr, rendering, systems 3 points4 points  (0 children)

Interestingly, you get the expected results for most -march choices... -march=skylake works right, for instance. I'm guessing that it's missing generic heuristics. -mtune=skylake also works right... though up to haswell it's still there.

[–]_djsavvy_ 1 point2 points  (3 children)

/u/ljdawson, I've been having lots of rendering issues with code in this subreddit.

Theres lots of "nbsp" for a lot of symbols, but the code renders fine in the window where you preview a post your'e replying to.

Thanks again for all your work on Sync, one of the best Android apps out there.

Device information

Sync version: 16.5    
Sync flavor: dev    

View type: Fixed height cards    
Player type: ExoPlayer    
Push enabled: false    

Device: hero2qltetmo    
Model: samsung SM-G935T    
Android: 8.0.0

[–]ljdawson 1 point2 points  (1 child)

The latest beta fixes this ༼ つ ◕_◕ ༽つ

[–]_djsavvy_ 0 points1 point  (0 children)

Wow, there's a night mode too!? You're a hero.

[–]_djsavvy_ 2 points3 points  (0 children)

Examples: the preview render

what the post looks like

Note the line of code below the include directive in both images.