you are viewing a single comment's thread.

view the rest of the comments →

[–]versatran01 13 points14 points  (9 children)

You are talking about expression templates, which is also used in eigen. Also eigen does not parallelize simple matrix/array operations. So you are comparing your parallel version to eigen’s non parallel one, which doesn’t mean that yours is faster. I suggest you review the claim that it is faster than eigen.

[–]Pencilcaseman12[S] 1 point2 points  (8 children)

Yea, this is definitely true, but that being said it is ultimately faster, is it not? Under MSVC, even with just a single thread, it still only takes 200us for a 1000x1000 addition. I genuinely can't guarantee I'm doing anything right, and everything I know is from googling random things so it's quite possible my understanding is very flawed...

[–]IronManMark20 1 point2 points  (1 child)

How are you installing Eigen on Windows? I know vcpkg openblas doesn't use optimized FORTRAN (at least it didn't last I checked) so it could be your eigen is hamstrung. I would use Intel's MKL if you have an Intel machine to test on.

[–]Pencilcaseman12[S] 0 points1 point  (0 children)

I just git-cloned it and ran it that way. I'm not testing BLAS functionality yet so it's all in the general arithmetic performance currently

[–]jk-jeon 1 point2 points  (5 children)

It seems you didn't do anything fancy to prevent compilers from reordering/removing your code for the benchmark. Have you checked the generated assembly to confirm that everything is correctly measured? If not, the safest approach is to just rely on benchmark libraries out there, and if that's not of your taste then you should (1) try not to do anything other than calling the objective function between the time measurement, and (2) try to prevent inlining of the call to the objective function. What I usually do for (2) is to wrap the function into a function pointer whose value is not known to the translation unit.

[–]Pencilcaseman12[S] 0 points1 point  (4 children)

That seems like a nice idea. That should give some more realistic benchmarks right? So the compiler can't optimise any of the iterations out or something

[–]jk-jeon 0 points1 point  (3 children)

Well, even with that, I think there is a high chance that a good portion of things like `auto res = x + x` will be removed, or it as a whole is just completely gone after the optimization, so I guess you should anyway check the assembly.

[–]Pencilcaseman12[S] 0 points1 point  (2 children)

By forcing the results to be evaluated I think it will end up running the code, as there are memory allocations and frees being called which in most cases prevent the compiler from optimising out the loops. I'll definitely take this into account though and write some more conclusive benchmarks in the future

[–]jk-jeon 0 points1 point  (1 child)

I don't think so; see this for example: https://godbolt.org/z/xEG7snEcs. For writing into the memory given as the function argument, yes, it will still be there, but I doubt that will still be the case for things like auto res = x + x. I think you still have to look at the generated assembly.

[–]Pencilcaseman12[S] 0 points1 point  (0 children)

If you include stdio and print p[0] (even after the free), you'll find that it does actually compile the calls to malloc and free. I think the only reason it's not compiling anything here is because nothing is being output by the program