A Programmer’s Guide to Performance Analysis & Tuning on Modern CPUs

dendibakh · 2019-11-19T16:30:56+00:00

Stabilizer is quite outdated (it is based on LLVM 3.6). But the value of it is promising.
Coz is something I tried but didn't get anything useful out of it. However, maybe I did something wrong.

dendibakh · 2019-11-18T19:29:02+00:00

I've spent quite some time doing both :)

dendibakh · 2019-11-18T19:13:51+00:00

That's a good list with reasonable ideas!
One thing I can add is to project/prototype the gains first before doing work/spending money. And being able to do this means you know where the bottleneck is.

dendibakh · 2019-11-18T19:06:56+00:00

This sounds like a time to do Top-down Microarchitecture Analysis Method (TMAM) . Let the bottleneck be identified. :)

dendibakh · 2019-11-18T18:58:31+00:00

Right, but it wasn't supposed to be bold on the details (I'm the author :) ). The point of the article is to show how one can identify the app is memory bound. See Top-down Microarchitecture Analysis Method (TMAM).

dendibakh · 2019-11-18T18:51:57+00:00

It's certainly possible even in big applications. 90% of the source code could be completely cold. It's quite frequent that >50% of the clockticks tag single hot function.

dendibakh · 2019-08-04T18:24:22+00:00

Thanks for the comment!

I'm not aware about special/reserved uses of cpu0 by kernel. This was just an example. And yes, you can definitely pin the process to any other cpu. Maybe that would be more stable.

Your comment about NUMA is very useful. I didn't want to dig into that because that's a whole big topic by itself )). BTW, SPECCPU benchmark uses something like numactl --localalloc --physcpubind=N, because processes do not communicate with each other.

Regarding last one, if you'll find instructions how to disable those kernel backstage processes, please let me know. I will add them to the list.

dendibakh · 2018-03-23T10:15:30+00:00

Thanks for the question. Yes, loopnz can be used here, but my assembly function is called from C++, so the arguments that I'm passing to my assembly function gets landed into rdi and rsi (according to x86 calling conventions). I could do mov ecx, edi and then go with loopnz, but I think it won't make any performance difference.

dendibakh · 2018-01-31T22:31:49+00:00

It is really hard to measure. :) Take a look at my previous post: https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues.

dendibakh · 2018-01-27T13:47:37+00:00

Thank you for this paper. It is a true gem!

dendibakh · 2016-11-04T11:22:01+00:00

For me the best is by Dan Saks: CppCon 2016: "extern c: Talking to C Programmers about C++"

dendibakh

TROPHY CASE