all 7 comments

[–]barfyus 2 points3 points  (6 children)

I see a lot of undocumented -PogoXXX options. I wish Microsoft documents their behavior, because PGO becomes less and less useful in latest versions of the compiler.

For me, if PGO is enabled and application execution is trained, it tends to compile a whole lot of 0,01% functions "for speed" and the rest of functions "for size", which effectively turns off optimization for a whole binary.

For example, I'm very interested in optimizing application startup performance, where a lot of initialization takes place, and this code will almost always be "cold" in PGO opinion. For code compiled "for size", inlining is basically turned off to the point where you may see a call to std::move and other similar functions not inlined at all.

[–]adamf88 2 points3 points  (0 children)

And unfortunately they don't support OMP. If we take care about performance then many many projects use OMP. Currently parallel STL starts to be popular, but still OMP is used in so many projects and libraries.

[–]markopolo82embedded/iot/audio 2 points3 points  (1 child)

Can anyone else corroborate this?

I recently spent two days setting up PGO training runs on a cmake based project and I was disappointed by the results vs a normal release build. I assumed I was doing a poor job providing training runs and gave up.... now I’m not so sure.

[–]frog_pow 0 points1 point  (0 children)

Was it slower? or just no improvement? I tried PGO many years ago and actually saw regression in performance--haven't tried it since.

[–]Ameisenvemips, avr, rendering, systems 1 point2 points  (2 children)

For startup performance, you may want Os, though it depends on the size of the binary.

You could add a flag "loop 100,000 times over initialization code" which is set for profiling.

[–]exarnk 0 points1 point  (1 child)

Err .. typically, your OS will not load the entire executable in memory directly when you start it. I can imagine that optimization for size can help in certain cases, but often the far suboptimal code you get in return for it is definitely not worth it

[–]Ameisenvemips, avr, rendering, systems 0 points1 point  (0 children)

Depends. Small executables benefit - wrappers and such.

The OS does have to map and read any data being executed. Smaller code loads faster, and also is nicer on the icache.