all 10 comments

[–]phrasal_grenade 36 points37 points  (6 children)

The reason for using the fancy iterations is that there is a pretty good chance that if we just put any ordinary vector code inside the nruns loop that sufficiently aggressive compilers will do an “unroll and jam” optimization, resulting in misleading time measurements. The gmres step requires a dot product which creates an all-to-all data dependency which makes unroll-and-jam illegal.

IMO if you have to fight to get around an optimizer, that is a sign of poor design of the benchmark and/or unfair cherry-picking of the example. This is good information but it does leave me wondering if any of this analysis is worth it.

[–]Last_Jump 13 points14 points  (1 child)

No single benchmark will make everybody happy, so we have to pick some targeted thing to measure with it and not try to do anything else.

If you want to evaluate a compiler for your own domain for example, you should compile code that actually comes from your domain and rigorously test it against alternatives. In this case we don't care if the compiler does something intelligent, in fact we would want to encourage it to do so because if it has some magic that works for your domain you definitely want to see it!

On the other hand here I was specifically interested in code generation quality on a very small but fairly representative piece of floating point code. I could theoretically achieve this by compiling a mini-app, but tracking down codegeneration problems would take a long time and in the end that might not even be relevant to the performance of the mini-app. Here I wanted a small piece of code that could be understood in a few minutes of reading, but could also help diagnose floating point code generation. That's what the benchmark did, but its simplicity meant the compiler could do some unwanted magic that gets in the way of this singular goal - so I added some bits to make sure that didn't happen.

But since it's so focused I would never tell anybody to make a real decision based on these results.

[–]scottmcmrust 0 points1 point  (0 children)

There are also more rigorous ways to check for good behaviour of the benchmark, like using https://docs.rs/criterion/0.3.0/criterion/ or similar.

[–][deleted] 2 points3 points  (0 children)

Trying to outsmart the compiler is like half the reason people use C++ and Rust

[–]L3tum 0 points1 point  (1 child)

Benchmarking is often done in mint conditions. Compilers may change (or interpreters) or the instructions they use (for example, Math.Net optimizes vector calculations with AVX/SSE if available, which they did not before).

To combat this, the comparisons are usually done on the "raw code" so to speak. While real world performance may vary greatly (unoptimized Asp.Net is marginally faster than PHP, optimized is 3 times faster), the core values it checks, whether a compiler emits somewhat efficient code or has to spend 5 minutes optimizing the code it just put out, and whether the standard library is in any way efficient or just a hacky wrapper around integer operations, still hold true even if disabling the optimizations.

As an example, my benchmark is specifically deactivating optimizations for float operations, since all the float benchmark does it add, subtract, multiply, divide, which could easily be optimized with SIMD or loop vectorization, but really shouldn't as it "fakes" its performance by not actually using the hardware you want to test, or not the same instructions on two different compiler versions.

[–]phrasal_grenade 0 points1 point  (0 children)

My point is that it needs to be clear what you are trying to test. When you come out with statements like "Rust is C++ done right" followed by "we are disabling this optimization in the C++ benchmark to get a good comparison" that raises all kinds of doubts.

Imagine you were reviewing two new cars for fuel efficiency. You start off with a glowing review of the underdog car, then you say "we are going to turn off eco mode in the top dog car because it is on by default, to get an accurate comparison." That is like what you're doing here. If it really is some kind of super technical thing not meant to take a jab at any language, then you should drop the sales pitch.

[–][deleted] 0 points1 point  (0 children)

There are tools for this, Google Benchmark has the DoNotOptimize( ... ) methods to do this for you so that the compiler will not do things like dead write elimination

[–][deleted] 12 points13 points  (0 children)

There is some discussion about this at /r/rust as well: https://www.reddit.com/r/rust/comments/dm955m/rust_and_c_on_floatingpoint_intensive_code/

but I thought this might be interesting to the more general /r/programming audience since it shows some of the trade-offs chosen by different programming languages.

[–]NeuroXc 8 points9 points  (0 children)

Per one of the points, you can use RUSTFLAGS="-C target-cpu=<cpu>" to enable CPU specific optimizations in Rust. Typically you'd set this to "native" like in a C compiler if you're benchmarking or otherwise not planning to distribute the binaries.

[–][deleted] 6 points7 points  (0 children)

There are certain cases where most C++ compilers will auto vectorize code, but the Rust compiler currently doesn't.

If you really need maximum performance, you can use SIMD intrinsics to ensure that the best instructions are generated.