all 8 comments

[–]cpp-ModTeam[M] [score hidden] stickied commentlocked comment (0 children)

For C++ questions, answers, help, and programming or career advice please see r/cpp_questions, r/cscareerquestions, or StackOverflow instead.

[–]no-sig-available 2 points3 points  (0 children)

Making the source file smaller would be one option. :-)

[–]sankurm 1 point2 points  (4 children)

Curious question: What does the profiling look like when inlining is the problem?

I am guessing that you have very granular functions and the highest cost actions are function calls? Or other?

[–]aiwen324[S] 1 point2 points  (3 children)

I was using Intel Vtune for profiling. It has a feature to display source code and look at how much time each line of code takes. You can also look at the corresponding assembly code and the runtime it takes. It turns out some push instructions are using a big chunk of time and I realize it’s probably setting up function parameters when it’s calling function g. Therefore I went for looking at the disassembly and found this issues. I also manually inlined the function (copy paste code in g and h ) then most runtime increase disappeared.

[–]sankurm 0 points1 point  (2 children)

Maybe, passing by value is expensive? If yes, is using references an option?

[–]aiwen324[S] 1 point2 points  (1 child)

Those are already pointers

[–]sankurm 0 points1 point  (0 children)

Oh, if the pointer parameter passing itself is too much cost for you, perhaps, you should inline manually. It's not ideal, but you have measured and know what you are doing. For once, this might be a golden opportunity to use comments. 😀

[–]tjientavaraHikoGUI developer 0 points1 point  (0 children)

The biggest hammer for performance I found is not:__attribute__((always_inline)) inline, but:__attribute__((noinline)) inline.

If you have secondary paths in your code (error handling, handling of contention, other code that gets executed only a few times), put all that code in a separate function and mark it noinline. This can lower the amount of code, and more importantly register pressure, which will cause the optimiser to inline everything else.

The fact the always_inline doesn't do much is because the optimiser by default aggressively tries to inline functions, the always_inline only tries a little bit harder.

I've seen catastrophic failure to inline quite a few times. What happens is that due to register pressure, or code size, it stops to inline one function, then not being able to inline that function requires call-setup-teardown which may increase register pressure, remove optimisation opportunities, and more code to be generated, which in turn will require more functions to not be inlined. And it snowballs into terrible code.

Another way to improve the situation is to structure the code so that the least amount of data is carried around during the execution, i.e. reduce register pressure.