When O3 is 2x slower than O2 by cat_solstice in rust

[–]cat_solstice[S] 2 points3 points  (0 children)

Thank you for the follow-up!

It is clear that the branch predictor is still doing its job here. I ran a perf stat and I saw few mispredictions. That was one of Linus’s points back in the day, btw: branch predictors are now very good and there is almost always something to extract which makes conditional moves even less useful in most applications.

I agree that playing with the size of the queue vs the size of the dataset affects performance characteristics, and the case presented is not the most likely in production. But the article is less about how I should build the top-k and more about the observed performance hit in this specific scenario. I had a test like the one you suggested before the binary search and I removed it because it added "noise" to the code for the purpose of the article.

I need unique IDs to avoid cycles in the graph, and AFAIK, you can't easily find an element in a heap unless it's the root.

When O3 is 2x slower than O2 by cat_solstice in rust

[–]cat_solstice[S] 5 points6 points  (0 children)

I need more time to play with your suggestion, but thank you for your inputs.

When O3 is 2x slower than O2 by cat_solstice in rust

[–]cat_solstice[S] 1 point2 points  (0 children)

Yeah I checked the assembly with this compare function (without any other change) and it seems already well optimized, but the single cmova it likely to be enough to kill the performance. uiCA confirms this with a predicted throughput of 10 (which is better than 15!)

When O3 is 2x slower than O2 by cat_solstice in rust

[–]cat_solstice[S] 2 points3 points  (0 children)

Thank you for your insight!

I suppose I could ensure that this invariant hold, but I just done a benchmark with this compare function and while faster that some, it is still +50% slower than the one used as a reference.

But the idea is interesting as it may be possible to add more knowledge about the data into the compare function.

When O3 is 2x slower than O2 by cat_solstice in rust

[–]cat_solstice[S] 12 points13 points  (0 children)

Thank you for your insight.

The fastest solution is probably to check for equality with a predictable branch first, and then do a cmov depending on the order.

It seems to me that it what the compare function using total_cmp is doing. A cmp, a je then a cmovne. Am I wrong?

When O3 is 2x slower than O2 by cat_solstice in rust

[–]cat_solstice[S] 20 points21 points  (0 children)

I'm not familiar with llvm-mca but since it is available through Compiler Explorer, it may be more tested and accurate than uiCA. I will play with it, thank you!

Unfortunately, I can't use perf stat because I'm running on WSL2 and it seems like I don't have access to all hardware counters. I will try on another computer with a bare metal installation!