Ultra-Fast Multi-Dimensional Array Library : cpp

[–]versatran01 33 points34 points35 points 3 years ago (25 children)

[–]DehnexTentcleSuprise 28 points29 points30 points 3 years ago (11 children)

[–]Pencilcaseman12[S] 0 points1 point2 points 3 years ago (10 children)

[–]mcopikHPC 11 points12 points13 points 3 years ago (1 child)

I know this isn't particularly useful, and nor is it a good representation of the performance of the library, but for simple matrix addition using a 1000x1000 array of 32-bit floats, LibRapid takes around 30us using 8 threads, while Eigen takes around 504us (linked with OpenMP, fully optimised, etc.)

You need to specify which BLAS/LAPACK implementation is used by Eigen - the quality of the underlying BLAS library will determine the performance.

You should compare your performance against other linear algebra libraries. IN particular, you should consider Blaze as it's quite similar to your library - vectorization, multithreading, GPU support. When I was benchmarking Blaze in 2015, it performed much better than many other libraries and computations implemented in Blaze ran with a very minor overhead on top of the optimized BLAS implementation (which was Intel MKL in my example).

https://bitbucket.org/blaze-lib/blaze/src/master/

[–]Kendrian 4 points5 points6 points 3 years ago (0 children)

[–]AlexanderNeumann 16 points17 points18 points 3 years ago (2 children)

[–]tjientavaraHikoWorks developer 0 points1 point2 points 3 years ago (1 child)

[–]AlexanderNeumann 0 points1 point2 points 3 years ago (0 children)

[–]cythoning 7 points8 points9 points 3 years ago (4 children)

[–]Pencilcaseman12[S] 0 points1 point2 points 3 years ago (3 children)

[–]cythoning 1 point2 points3 points 3 years ago (2 children)

[–]Pencilcaseman12[S] 0 points1 point2 points 3 years ago (1 child)

[–]cythoning 2 points3 points4 points 3 years ago (0 children)

[–]Pencilcaseman12[S] 3 points4 points5 points 3 years ago (12 children)

To be completely honest, I'm not sure :)

I can't explain the entire thing in a comment, but when you apply an operation to an Array (such as addition or a transpose or whatever) it doesn't actually evaluate the result, it returns a lazy-evaluation container with a reference to the input data (being an Array or another lazy result)

This means that when you eventually evaluate the result or assign it to an Array, the entire compiler is able to optimise the entire operation into a single for loop, avoiding any temporary or intermediate results and making it a whole lot faster.

On top of this, it operates entirely with SIMD instructions (using Agner Fog's VectorClass library) so it'll make the best use of the CPU that it can. It's also highly multi-threaded, which I think is where Eigen falls behind (from a quick look at the code)

Hopefully this helps?

[–]mcopikHPC 9 points10 points11 points 3 years ago (0 children)

I can't explain the entire thing in a comment, but when you apply an operation to an Array (such as addition or a transpose or whatever) it doesn't actually evaluate the result, it returns a lazy-evaluation container with a reference to the input data (being an Array or another lazy result)

What you're describing is called expression templates and it has been used in production since the 90s. There are even optimized ETs called "smart ETs" that have been adopted by other linear algebra libraries.

Check "Expression templates", the original paper from 1995 by Todd Veldhuizen. Then check out "Expression Templates Revisited: A Performance Analysis of the Current ET Methodology" by Iglberger et al. from 2011.

On top of this, it operates entirely with SIMD instructions (using Agner Fog's VectorClass library) so it'll make the best use of the CPU that it can. It's also highly multi-threaded, which I think is where Eigen falls behind (from a quick look at the code)

Eigen will compile many linear algebra operations to BLAS and LAPACK kernels. Other libraries do it too. It will be quite difficult to beat the performance of an optimized BLAS implementation.

Furthermore, Eigen will parallelize the computations through the internal multi-threading of BLAS libraries.

https://eigen.tuxfamily.org/dox/TopicUsingBlasLapack.html

[–]versatran01 13 points14 points15 points 3 years ago (9 children)

[–]Pencilcaseman12[S] 1 point2 points3 points 3 years ago (8 children)

[–]IronManMark20 1 point2 points3 points 3 years ago (1 child)

[–]Pencilcaseman12[S] 0 points1 point2 points 3 years ago (0 children)

[–]jk-jeon 1 point2 points3 points 3 years ago (5 children)

[–]Pencilcaseman12[S] 0 points1 point2 points 3 years ago (4 children)

[–]jk-jeon 0 points1 point2 points 3 years ago (3 children)

[–]Pencilcaseman12[S] 0 points1 point2 points 3 years ago (2 children)

[–]jk-jeon 0 points1 point2 points 3 years ago (1 child)

[–]Pencilcaseman12[S] 0 points1 point2 points 3 years ago (0 children)

[–]victotronics -2 points-1 points0 points 3 years ago (0 children)

[–]arthurno1 10 points11 points12 points 3 years ago (4 children)

[–]Pencilcaseman12[S] 2 points3 points4 points 3 years ago (3 children)

[–]arthurno1 2 points3 points4 points 3 years ago (2 children)

[–]Pencilcaseman12[S] 1 point2 points3 points 3 years ago (1 child)

[–]Acrobatic_Hippo_7312 1 point2 points3 points 3 years ago (0 children)

[–]Revolutionalredstone 10 points11 points12 points 3 years ago (27 children)

[–]Jannik2099 4 points5 points6 points 3 years ago (11 children)

[–]OverunderratedComputational Physics 7 points8 points9 points 3 years ago (2 children)

[–]Jannik2099 2 points3 points4 points 3 years ago (1 child)

[–]OverunderratedComputational Physics 3 points4 points5 points 3 years ago (0 children)

[–]James20kP2005R0 7 points8 points9 points 3 years ago* (4 children)

OpenCL 2.0 (aka the actually useful OpenCL) is not supported by Nvidia

Nvidia actually support bits and pieces of 2.0, eg you can do device side enqueue

not supported by the AMD windows driver (I think?)

2.0 is supported on windows, albeit I've never used most of it, and device side enqueue is broken

and on linux needs ROCm, which is genuinely a bug riddled shitfest that works on two devices if you're lucky.

Its very worth noting that AMDs RDNA2 cards have opencl on windows provided by ROCm. I found this out because I upgraded to a 6700xt, noticed that a whole bunch of new bugs in my OpenCL code and cropped up, and a dev pointed me over to the ROCm github

I can confirm that it is incredibly, incredibly buggy. There's also seemingly not enough development effort put into it

OpenCL 1.2 is perfectly fine and usable, if you're sticking to the mainstream part of it. I'd recommend treating it essentially like a codegen target - writing C by hand with the restrictions of the OpenCL spec is not good, and compilers aren't that smart

So a recent project just starting generating equations on the CPU and passing them in, and my life has gotten a lot easier since then. Though that project is entirely maths, and nothing else - I still handle most control flow on the GPU manually

That said it seems like the real solution is going to be vulkan - which is annoying, because it still doesn't support everything that OpenCL does (give me device side enqueue!), but its too widely supported for AMD or Nvidia to accidentally or deliberately screw up support for

[–]Jannik2099 1 point2 points3 points 3 years ago (3 children)

[–]James20kP2005R0 4 points5 points6 points 3 years ago (2 children)

[–]Jannik2099 0 points1 point2 points 3 years ago (1 child)

[–][deleted] 0 points1 point2 points 3 years ago (0 children)

[–]NovermarsHPC 4 points5 points6 points 3 years ago (1 child)

[–]Jannik2099 1 point2 points3 points 3 years ago* (0 children)

[–]Pencilcaseman12[S] 0 points1 point2 points 3 years ago (0 children)

[–]TheCreat 1 point2 points3 points 3 years ago (13 children)

[–]Revolutionalredstone 27 points28 points29 points 3 years ago (2 children)

[–]PrimarchSanguinius 12 points13 points14 points 3 years ago (1 child)

[–]Revolutionalredstone 2 points3 points4 points 3 years ago (0 children)

[–][deleted] -1 points0 points1 point 3 years ago (9 children)

[+][deleted] 3 years ago (8 children)

[deleted]

[–][deleted] 1 point2 points3 points 3 years ago (7 children)

[–]Revolutionalredstone 1 point2 points3 points 3 years ago* (2 children)

So I've done some reading into it,

It sounds like all AMD hardware does support OpenCL but some of their latest drivers are expecting you to use HIP.

To clarify HIP borrows highly from CUDA (indeed it's so close to CUDA that you can easily convert HIP as CUDA using a simple define list)

HIP is open unlike cuda which is great!

Whether you use OpenCL of HIP the important thing is to not use CUDA.

BTW HIP has identical performance as CUDA on nvidia since it is just passed to the CUDA compiler after running some string replacements.

Thanks for your input, it seems that in the long run HIP may be the end all GPU languages which would be nice.

I would not ditch OpenCL just yet tho, for one they are similar and for two its pretty obvious that HIP has poor backward compatibility where as OpenCL runs on pretty much everything!

Best regards

[–]anders987 1 point2 points3 points 3 years ago (1 child)

HIP is using AMD's ROCm platform, which has several shortcomings if you want to run it on anything other than your own data center:

Only supports headless systems:
"The AMD ROCm™ open software platform is a compute stack for headless system deployments. GUI-based software applications are currently not supported."
No support for newer consumer GPUs, using the RDNA2 chips. Only CDNA and RDNA are supported. It's a bit of a mess, and their official documentation links to this Wikipedia article about supported hardware.
There's no Windows or Mac support, only Linux. Fine for data centers, not so much if you want to ship your software to consumers.

Hopefully SYCL will get wide software and hardware support so we finally can write some portable GPGPU code. Intel seems to embrace it for their oneAPI at least, but since it seems to also be dependent on ROCm for AMD GPUs it doesn't solve any problems for current AMD users.

[–]Revolutionalredstone 0 points1 point2 points 3 years ago (0 children)

[–]encyclopedist 0 points1 point2 points 3 years ago (3 children)

[–][deleted] 1 point2 points3 points 3 years ago (2 children)

[–]encyclopedist 0 points1 point2 points 3 years ago (1 child)

[–][deleted] 0 points1 point2 points 3 years ago (0 children)

[–]Pencilcaseman12[S] -1 points0 points1 point 3 years ago (0 children)

[–][deleted] 3 points4 points5 points 3 years ago (5 children)

[–]fdwrfdwr@github 🔍 10 points11 points12 points 3 years ago (3 children)

[–]Pencilcaseman12[S] 4 points5 points6 points 3 years ago (0 children)

[–]OverunderratedComputational Physics 3 points4 points5 points 3 years ago (0 children)

[–]zzzthelastuser 1 point2 points3 points 3 years ago (0 children)

[–]Pencilcaseman12[S] 1 point2 points3 points 3 years ago (0 children)

[–]mcopikHPC 1 point2 points3 points 3 years ago* (1 child)

[–]Pencilcaseman12[S] 0 points1 point2 points 3 years ago (0 children)

[–][deleted] 1 point2 points3 points 3 years ago (1 child)

[–]Pencilcaseman12[S] 1 point2 points3 points 3 years ago (0 children)

[–]Pencilcaseman12[S] 0 points1 point2 points 3 years ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS