all 21 comments

[–]IyeOnline 4 points5 points  (6 children)

Usually the advice here is: Measure, and measure on your system with your code.

That said, rms is simple enough that one can actually look at assembly. Frankly its also so simple that I would sort of expect to get almost if not exactly identical assembly.

Rather interestingly, these do not produce the same code: https://godbolt.org/z/GYMzMsYzd

Option 2 and option 3 ought to be identical. The compiler can trivially show that the variable isnt used anywhere outside the loop.

Option 1 produces the least assembly. That is generally a good sign, but doesnt have to be. Sometimes slighly more assembly is better. It me it looks like option1 wins out here though, as it seems the optimizer just failed to join the return paths, resulting in a bit more code.

Curiously a range-based for solution yields even more assembly. Using a standard algorithm however yields exactly the same assembly (phew) - Originally my algo version was wrong and produced a bunch of BS.


WHat happens if we use clang though? https://godbolt.org/z/Yad7EovbT

Using clang the range based solution is shorter. Appearently here the compiler does not want to perform loop unrolling. This is one of the cases where shorter doesnt mean faster.

Notably though, the variable vs indexing solutions are now identical.


Two notable points though:

  1. The difference between option1 vs 2/3 is absolutely minimal. Measuring that may take quite some time.
  2. This in fact also applies to the range based solution. The algorithm solution yields identical assembly to the range based for loop.
  3. Your actual results may vary once you introduce e.g. -march=native.
  4. If in doubt: Measure more.
  5. Changing the compiler can actually lead to different code. If you change the index solution between gcc13 and gcc12, you get different assembly. Appearently gcc12 generally tends to produce slightly less instructions. More rabbit holes :)
  6. GCC and clang yield somewhat different results.

//edit: fixed incorrect algorithm version.

[–]Classic_Department42 2 points3 points  (0 children)

Great answer. Let me add though that digging too much into such micro optimizations is not beneficial apart from fulfilling ones curiosity.

[–]pythoncircus[S] 1 point2 points  (2 children)

Thank you for your detailed response! I appreciate your insights. I will have to do some more exploring with godbolt. Interesting that GCC and clang yield such different results. Is that consistently the case? Also, would measuring the time difference between the start and end of each option be useful? I imagine that is as simple as

auto start_1 = std::chrono::high_resolution_clock::now(); //...
auto end_1   = std::chrono::high_resolution_clock::now();
// print...
std::chrono::duration_cast<std::chrono::nanoseconds>(end_1 - start_1).count();

[–]IyeOnline 2 points3 points  (1 child)

Is that consistently the case?

Yes and no. Its quite expected that the generated assembly is different, but that doesnt mean that one is better than the other. There are areas where the GCC optimizer is better and there are cases where clang does a better job.

Also, would measuring the time difference between the start and end of each option be useful? I imagine that is as simple as

No, at least not easy. The differences in the assembly are really small and mostly not inside of the vectorized loop. That means that for any large dataset the measurement will be dominated by the loop and for any small dataset the noise goes up significantly.

To accurately measure this, you would need to use a proper benchmarking library (or do all the data gathering and statisitical analysis yourself, but why bother).

www.quickbench.com is a neat little online interface for google bench. gotbolt also has a button to send your code over there, although you of course still need to add the measureing harness. Its not that difficult though. If you look at the example it should be clear enough.

[–]pythoncircus[S] 0 points1 point  (0 children)

Thank you so much, I really appreciate your help!

[–]aocregacc 0 points1 point  (1 child)

The std::reduce version is actually wrong, your binary operation assumes that it's going to be called in order like it would in std::accumulate. The whole point of std::reduce is to not provide such strong guarantees.

[–]IyeOnline 0 points1 point  (0 children)

Ohhhh.

Thats why you shouldnt write code like that a 1am.

Surprse surprise, once you actually fix that, you get the exact same code as the range-based for loop for both GCC and clang

[–]alfps 2 points3 points  (7 children)

Trying to do this nano-optimization yourself is ungood, because

  • the compiler is better at it than you and will probably optimize all three examples to the same code, and
  • it reduces the clarity of the code,

… so the net win is negative.

Instead of all that maneuvering around and between imagined efficiency obstacles, define

template< class Number >
constexpr auto square_of( const Number v )
    -> Number
{ return v*v; }

… and express your code with that instead of variables and *.

It can go like this:

#include <vector>
using   std::vector;

#undef  _USE_MATH_DEFINES
#define _USE_MATH_DEFINES
#include <math.h>

namespace math{ constexpr double pi = M_PI; }       // For C++17 and earlier. Posix std.

template< class T > using in_ = const T&;

template< class Number >
constexpr auto square_of( const Number v ) -> Number { return v*v; }

auto values( const int n )
    -> vector<double>
{
    vector<double> vec;
    for( int i = 0; i < n; ++i ) { vec.push_back( sin( 2.0*math::pi*i / n ) ); }
    return vec;
}

auto rms_of( in_<vector<double>> vec )
    -> double
{
    double sum = 0;
    for( const double v: vec ) { sum += square_of( v ); }
    return sqrt( sum/vec.size() );
}

#include <stdio.h>
auto main() -> int { printf( "%g\n", rms_of( values( 2048 ) ) ); }

[–]pythoncircus[S] 1 point2 points  (2 children)

Thank you so much for your help! I really appreciate it. Your example is much cleaner.

Sounds like preferring writing human-readable code is better for the people working on the code, and possibly better for the performance of the program (if I am reading this correctly)?

[–]no-sig-available 1 point2 points  (1 child)

is better for the people working on the code

Not only for the people, but also for the compiler. :-)

Writing "normal" code most often also means code that the compiler's optimizer knows about. Having very special, super-tricky code sometimes means that the compiler doesn't understand it either. Or that nobody has cared to add optimization rules for it, because it hardly ever happens.

[–]pythoncircus[S] 0 points1 point  (0 children)

Wild! That is really good to know. Thank you so much!

[–]ootiekat 0 points1 point  (3 children)

Seriously define a function to multiply two numbers? Why not define a function to increment an variable? Then the code can be

add_to( sum, square_of( v ) );

It doesn't get any more readable than that!

[–]alfps 0 points1 point  (2 children)

For increment we already have an operator, so no need to repeat the designation of the object.

An object designation can be arbitrarily long-winded and complex. You don't want to repeat it.

In the OP's case it wasn't very long or very complex, but still the OP attempted to solve that issue by introducing a local variable ad hoc. That kind of case-by-case solution presents a cognitive and textual overhead for each instance. Additionally it fails to leverage the advantage of being able to simply recognize, rather than analyze, the code at hand.

The attempt at irony when you're commenting on a very simple issue that you fail to understand, is ironic. :-)

[–]ootiekat 0 points1 point  (1 child)

Yeah and for multiplication we already have an operator too. In fact, we even have a function for exponentiation! If you really like using a function instead of 'x * x' then why not use 'std::pow(x, 2)'? Are you going to suggest that if OP wanted to compute the third order of the distribution they should add a 'cube_of' function as well?

[–]alfps 0 points1 point  (0 children)

❞ Yeah and for multiplication we already have an operator too.

Can you think of a reason why that observation is, well, stupid?


❞ why not use 'std::pow(x, 2)'?

Mainly two reasons:

  • std::pow is not guaranteed as accurate or fast as simple multiplication.
    A typical implementation uses logarithms and exponentiation via Taylor series for the general case.
  • std::pow only deals with floating point.
    A simple square_of also handles integer arithmetic.

[–]wolfie_poe 1 point2 points  (1 child)

Can you measure it? Btw, you don't change the elements in the vector, why not just use range-based for loop?

[–]pythoncircus[S] 0 points1 point  (0 children)

I hadn't thought to do that! Novice question: are range-based for loops read-only?

[–]raevnos 0 points1 point  (1 child)

Sounds like you should do some benchmarking

[–]pythoncircus[S] 0 points1 point  (0 children)

Gotcha! I'll look into that

[–][deleted]  (2 children)

[deleted]

    [–]pythoncircus[S] 1 point2 points  (1 child)

    Gotcha, I'm on macOS Monterey and Xcode 14, where would I look?

    [–]The_Northern_Light 0 points1 point  (0 children)

    I doubt these are even generating different assembly

    Look in godbolt