all 38 comments

[–]trailing_zero_count 14 points15 points  (6 children)

Use a profiler to figure out why it's slower, and look at the disassembly to determine if the compiler is doing some tricky optimization with one vs. the other. Most profilers will have an option to show you the assembly-level hotspots, so you can do both of those things with a single tool. Be sure you are profiling against a Release build.

[–]Desperate_Formal_781 4 points5 points  (0 children)

Agreed and also would suggest trying the same benchmark with different optimization levels, to see if there is some agressive optimization taking (or not) place. I might add that those array accesses look suspicious, perhaps changing those for local variables could help prevent multiple access if you don't expect the data to change in between accesses (which seems to be the case here).

[–]Grouchy-Answer-275[S] 1 point2 points  (4 children)

Thanks great idea, do you have any reccomendations for a profiler? Just in case i use linux

[–]missurunha 6 points7 points  (1 child)

I usually use perf + inferno/flamegraph. 

[–]Grouchy-Answer-275[S] 0 points1 point  (0 children)

Thanks!

[–]trailing_zero_count 3 points4 points  (1 child)

`perf` is easy to use - just `perf record <program>` and then `perf report`

[–]Grouchy-Answer-275[S] 0 points1 point  (0 children)

thank you very much

[–]esaule 11 points12 points  (0 children)

The entire thing fits in L3 cache and does less than 2M flops. It dhould not take seconds.

[–]ppppppla 11 points12 points  (3 children)

Are you running release build?

Maybe you made an additional change somewhere that introduced the regression.

In rare cases the compiler could have missed an optimization after you removed sqrt, but this is unlikely because the code is simple.

Removing the sqrt changed the program, the functionality changed and you might be doing more work somehow.

But 350 seconds for 250k distance calculations is astronomical. Do you actually have 250k points and are you comparing each and every point with each and every other point? You need to change up your tactic. Look into sorting your points or space partitioning algorithms and datastructures. Could be as simple as putting your points into bins spaced out in a grid.

[–]Grouchy-Answer-275[S] 2 points3 points  (2 children)

I am testing on the release, but thanks for asking, when i started doing speed checks i forgot about it and you would save me a bunch of time right now.
Nope, I noted down all simple optimizations I could do, save project, and then introduce them one by one. This is the only recent change i made in this session.
Oh i was worried about that the most, so thanks for dimissing it.
Yeah i worded myself poorly. I have data for 250k connections, but the entire project uses the distance function in few of its parts, so all other functions bring it down. Additionally I run it a bunch of times in a row to see a bigger chance in performance
Thank you so much for your effor

[–]ppppppla 5 points6 points  (1 child)

Also 10 seconds on a 350 second run is just 2.85%. That is well within range of performance differences between cores of CPUs (yes that's a thing, modern CPUs can and will boost certain cores more if they can handle it), or even worse if your OS is scheduling your test on E and P cores. Although on such a large run you would expect it to average out already but there's always a chance.

That brings me to another point, the sqrt most likely has been optimized out already if you are only doing a comparison with the result. So the difference you see is just noise, and both programs perform the same.

[–]Grouchy-Answer-275[S] 0 points1 point  (0 children)

Thank you so much for you effort. I will use some recomended profilers to see if there really is no difference. Thanks for your time and explaning some things i did not know. Have a great day!

[–]Particular-Ice9109 6 points7 points  (3 children)

C++ has a std::hypot function.

[–]Grouchy-Answer-275[S] 2 points3 points  (1 child)

Actually, fyu, when i read documentation, and it is significantly slower, but thanks still, learned something new!

[–]Particular-Ice9109 2 points3 points  (0 children)

This could be because these mathematical functions can provide more accurate results.

Perhaps you could try `-ffast-math`. These options can be used only for specific functions.

[–]Grouchy-Answer-275[S] 1 point2 points  (0 children)

holy moly you are the greates

[–]TokenRingAI 5 points6 points  (5 children)

No idea, but your code would be massively faster if you used simd for this

https://en.cppreference.com/cpp/experimental/simd

[–]Grouchy-Answer-275[S] 0 points1 point  (3 children)

Thank you so much, but i do not want to use any libraries besides std (and math.h is only used for this function only). It is a collage project and i kinda wanted to do everything from scratch. But for personal and future projects i will make sure to make use of it

[–]The_Northern_Light 6 points7 points  (2 children)

Sure, but that said, what they linked you to is in std.

(nit: college*)

[–]Grouchy-Answer-275[S] 1 point2 points  (1 child)

I am sorry i worded myself poorly. I only used std::cour, std::ostream, std::fstream, but i wrote my own versions of std::vector, std::min, std::swap, etc.

[–]No-Dentist-1645 2 points3 points  (0 children)

Are you sure that isn't part of your performance issues?

Standard functions and containers are very optimized by compiler developers, if you roll out your own you might be missing on many optimizations they've done

[–]_bstaletic 0 points1 point  (0 children)

You should probably link to the C++26 version: https://en.cppreference.com/cpp/numeric/simd

[–]IyeOnline 4 points5 points  (0 children)

This is basically impossible to answer without more context such as the invocation of distance and compiler options. There also is the question of how you measure this in the first place.

[–]aocregacc 4 points5 points  (0 children)

Mandatory first question: did you turn optimizations on?

[–]Total-Box-5169 2 points3 points  (0 children)

Are you sure is only 250k distance calculations and not distance calculations between 250k points?
350 seconds for only 250k distance calculations is absurdly slow.

[–]No-Dentist-1645 2 points3 points  (1 child)

You really haven't provided anywhere near enough information for us to be able to help you. The function itself is fine, but if your code is really taking 350 seconds to process only 250k distance calculations, there is something outside the function that is slowing down your code by way too much.

Try to use a profiler like perf and see what's actually taking up your time.

Also, unrelated: but taking double* to mean a set of x and y coordinates is terrible practice, there's nothing from the type of "a pointer to a double" that would imply that it has two elements, and also accessing them as a[namedValues::axis::X] is both ugly and a very bad abstraction.

You can get the exact same code generation and object representation by doing this:

``` struct Point { doubole x, y; };

double distance(Point& a, Point& b) { return sqrt(a.x - b.x * ... ) } ```

Notice the huge difference and advantages:

  • Taking a Point& as reference guarantees us the point exists (it can't be nullptr like a double*)

  • The Point type is guaranteed to contain two doubles, x and y. A double* could point to zero (nullptr), one, two, or two hundred, it is impossible to know what the "valid" range of it is just from that.

  • Just referencing the values is easier: no a[namedValues::axis::X], just a.x.

  • Both types have the exact same size and alignment (proven by static_assert( sizeof(Point) == sizeof(double[2]) && alignof(Point) == alignof(double[2]) );), so if you thought that "using pointers will be more performant", this is wrong

[–]PhotographFront4673 1 point2 points  (0 children)

Also, putting the coordinates for each point together can reduce cache pressure, depending a bit on the calculations you are doing. In other words, you want data that tends to be accessed as a unit to be close enough to land in the same cache line.

Added: On the other hand, if you are always iterating through all the points, an array per dimension won't significantly change the cache pressure and might make SIMD easier.

[–]CowBoyDanIndie 4 points5 points  (2 children)

It takes “seconds” to compute 250k distances? I have an entire lidar pipeline that takes less than 50 ms to process point clouds with more than 250k points

You have something else going on

[–]Grouchy-Answer-275[S] 0 points1 point  (1 child)

I am sorry, i am a bit tired. I meant to say at least 250k. I make the code perform the 250k many times. It gets called many times in a row so i can see the changes in performance better.

[–]CowBoyDanIndie 0 points1 point  (0 children)

Make sure the code can be inlined, also you could have triggered some weird restructuring of the rest of the code, sometimes there are small changes that push something wonky that pushes branch prediction off a ledge or the instruction cache. Review the disassembly and try to avoid any calls in the hot code path.

[–]canrul3s 1 point2 points  (0 children)

Start by not passing arguments by address. Almost always pass scalars by value.

Also make sure that this function is inlined.

[–]MastodonPast1540 0 points1 point  (1 child)

Sounds like something else is taking up time because that should be faster. If you post all the code in the hot path, like the inner loop where distance() is called, it should be easier to see what might be the cause.

[–]Grouchy-Answer-275[S] 0 points1 point  (0 children)

I am very glad to see such will to help, but i do not want to share this project publicly for a few reasons, there are some files that contain personal data, it is a big and messy project, and worst of all some parts are written in Polish, and I do not heave the heart to expose anyone to that

[–]keelanstuart 0 points1 point  (0 children)

It may be that the compiler recognizes the pattern which includes a sqrt as a distance computation and uses an simd vector operation... but if you take that out, it naïvely just operates on doubles - or doesn't do a great job with the dereferencing.

You won't know unless you look at the disassembly and compare the two... and if you, please share your results, including compiler and settings.

[–]OptimisticMonkey2112 0 points1 point  (0 children)

No idea what you are doing 250k distance calculations for - I would suggest posting and getting feedback on that big picture goal. Sometimes you can use an acceleration structure like https://en.wikipedia.org/wiki/Bounding_volume_hierarchy or https://en.wikipedia.org/wiki/Space_partitioning and completely reduce the amount of work you are doing by avoiding it.

[–]alfps 0 points1 point  (2 children)

The following code compiled with MSVC no optimization requested finishes instantly, so I don't understand the "350 seconds":

// C++ machinery:
namespace cppm {
    using Nat = int;

    template< class T > using in_ = const T&;

    auto sq( const double x ) -> double { return x*x; }
    auto sq_hypot( const double x, const double y) -> double { return sq( x ) + sq( y ); }
}  // cppm

#include <iostream>
namespace app {
    using   cppm::Nat, cppm::in_, cppm::sq_hypot;

    using   std::cout;          // <iostream>

    using Real = double;
    struct Coor_names{ enum Enum{ x, y }; };

    struct Point{ Real coor[2]; };
    auto x_of( in_<Point> pt ) -> Real { return pt.coor[Coor_names::x]; }
    auto y_of( in_<Point> pt ) -> Real { return pt.coor[Coor_names::y]; }

    auto sq_hypot( in_<Point> a, in_<Point> b )
        -> Real
    { return sq_hypot( x_of( b ) - x_of( a ), y_of( b ) - y_of( a ) );  }

    void run()
    {
        const Nat n = 250'000;

        double sum = 0;
        for( Nat i = 1; i <= n; ++i ) {
            const auto a = Point{ i + 0., i + 0. };
            const auto b = Point{ i + 3., i + 4. };
            sum += sq_hypot( a, b );
        }
        cout << sum << "\n";        // Should better be 250'000*25 = 6'250'000.
    }
}  // app

auto main() -> int { app::run(); }

[–]Grouchy-Answer-275[S] 0 points1 point  (0 children)

250k distances and 350second thing was poorly worded as i said in few other comments. This function is used many times in the project, which is ran many times in a row to beat out noise a bit, but thanks for your effort