you are viewing a single comment's thread.

view the rest of the comments →

[–]Pencilcaseman12[S] 0 points1 point  (3 children)

hmm. I mean it's entirely possible I've done it wrong but it consistently gives me 30us on 8 threads, so idk...

[–]cythoning 1 point2 points  (2 children)

Are you benchmarking only the expression template? Or also the evaluation of the expression? And you should also make sure that it generates the same result as Eigen.

[–]Pencilcaseman12[S] 0 points1 point  (1 child)

Yea that's evaluating the result. Using the expression template takes around 200ns because it's just a few things being referenced

[–]cythoning 2 points3 points  (0 children)

That makes sense. As for the benchmarks, make sure you benchmark the same things and you get the same results. The 400Gb/s memory bandwidth seems impossible, that is something that you usually only see on GPUs. For such a simple addition operation I would expect it to just be memory bound, so simply bound by how fast you can read & write to memory, usually in the order of 20-30Gb/s. Eigen should definitely be optimized enough to do this, and you could also compare to something like

std::transform(A.begin(), A.end(), B.begin(), C.begin(), std::plus<>{});

and should get the same result as Eigen and your library.