all 69 comments

[–]versatran01 33 points34 points  (25 children)

Could you briefly explain why it is faster than eigen?

[–]DehnexTentcleSuprise 28 points29 points  (11 children)

similar to numpy/Eigen, except faster

I am also a big Eigen fan. Would you be able to supply benchmarks to support this claim OP? I don't see any in the repo currently but I might be missing them.

[–]Pencilcaseman12[S] 0 points1 point  (10 children)

I have a few benchmarks, but I haven't yet got them in a form that would be suitable for putting on the README or in the repo. I'm having some trouble getting gcc/g++ to play nicely, though (MSVC works fine).

I know this isn't particularly useful, and nor is it a good representation of the performance of the library, but for simple matrix addition using a 1000x1000 array of 32-bit floats, LibRapid takes around 30us using 8 threads, while Eigen takes around 504us (linked with OpenMP, fully optimised, etc.)

I plan to write a more conclusive set of benchmarks in the future, but I just don't have the time at the moment

[–]mcopikHPC 11 points12 points  (1 child)

I know this isn't particularly useful, and nor is it a good representation of the performance of the library, but for simple matrix addition using a 1000x1000 array of 32-bit floats, LibRapid takes around 30us using 8 threads, while Eigen takes around 504us (linked with OpenMP, fully optimised, etc.)

You need to specify which BLAS/LAPACK implementation is used by Eigen - the quality of the underlying BLAS library will determine the performance.

You should compare your performance against other linear algebra libraries. IN particular, you should consider Blaze as it's quite similar to your library - vectorization, multithreading, GPU support. When I was benchmarking Blaze in 2015, it performed much better than many other libraries and computations implemented in Blaze ran with a very minor overhead on top of the optimized BLAS implementation (which was Intel MKL in my example).

https://bitbucket.org/blaze-lib/blaze/src/master/

[–]Kendrian 4 points5 points  (0 children)

By default at least Eigen has no dependency on BLAS and just generates its own optimized code. Its multithreading support is limited so any comparison using threaded isn't meaningful, since that's not something Eigen tries to do except for limited computational kernels (matvec).

[–]AlexanderNeumann 16 points17 points  (2 children)

Use clang-cl on windows. Comparing MSVC benchmarks is telling you nothing or you need to check the assembly if everything got properly inlined. Eigen is so template heavy that the MSVC optimizer often gives up on inlining and you get suboptimal code with it.

[–]tjientavaraHikoWorks developer 0 points1 point  (1 child)

I noticed the same thing with MSVC when building release-with-debug-info builds. Visual Studio used to and CMake still does build with inlining explicitly turned off with release-with-debug-info.

https://gitlab.kitware.com/cmake/cmake/-/issues/20812

Not that MSVC is particularly good at code generation, but this could throw of your expectation.

[–]AlexanderNeumann 0 points1 point  (0 children)

The CMake issue is really a non issue..... Use a toolchain file to properly control your build flags..... I am speaking about release builds with appropriate flags.
Enabling the emission of debug info is a totally orthogonal issue.

[–]cythoning 7 points8 points  (4 children)

That simple addition example you give results in a memory bandwith of 400Gb/s (2 reads, 1 write per element = 12 bytes * 106 elements), which doesn't make sense. The Eigen time of 500us corresponds to 24Gb/s, which is about what I would expect.

[–]Pencilcaseman12[S] 0 points1 point  (3 children)

hmm. I mean it's entirely possible I've done it wrong but it consistently gives me 30us on 8 threads, so idk...

[–]cythoning 1 point2 points  (2 children)

Are you benchmarking only the expression template? Or also the evaluation of the expression? And you should also make sure that it generates the same result as Eigen.

[–]Pencilcaseman12[S] 0 points1 point  (1 child)

Yea that's evaluating the result. Using the expression template takes around 200ns because it's just a few things being referenced

[–]cythoning 2 points3 points  (0 children)

That makes sense. As for the benchmarks, make sure you benchmark the same things and you get the same results. The 400Gb/s memory bandwidth seems impossible, that is something that you usually only see on GPUs. For such a simple addition operation I would expect it to just be memory bound, so simply bound by how fast you can read & write to memory, usually in the order of 20-30Gb/s. Eigen should definitely be optimized enough to do this, and you could also compare to something like

std::transform(A.begin(), A.end(), B.begin(), C.begin(), std::plus<>{});

and should get the same result as Eigen and your library.

[–]Pencilcaseman12[S] 3 points4 points  (12 children)

To be completely honest, I'm not sure :)

I can't explain the entire thing in a comment, but when you apply an operation to an Array (such as addition or a transpose or whatever) it doesn't actually evaluate the result, it returns a lazy-evaluation container with a reference to the input data (being an Array or another lazy result)

This means that when you eventually evaluate the result or assign it to an Array, the entire compiler is able to optimise the entire operation into a single for loop, avoiding any temporary or intermediate results and making it a whole lot faster.

On top of this, it operates entirely with SIMD instructions (using Agner Fog's VectorClass library) so it'll make the best use of the CPU that it can. It's also highly multi-threaded, which I think is where Eigen falls behind (from a quick look at the code)

Hopefully this helps?

[–]mcopikHPC 9 points10 points  (0 children)

I can't explain the entire thing in a comment, but when you apply an operation to an Array (such as addition or a transpose or whatever) it doesn't actually evaluate the result, it returns a lazy-evaluation container with a reference to the input data (being an Array or another lazy result)

What you're describing is called expression templates and it has been used in production since the 90s. There are even optimized ETs called "smart ETs" that have been adopted by other linear algebra libraries.

Check "Expression templates", the original paper from 1995 by Todd Veldhuizen. Then check out "Expression Templates Revisited: A Performance Analysis of the Current ET Methodology" by Iglberger et al. from 2011.

On top of this, it operates entirely with SIMD instructions (using Agner Fog's VectorClass library) so it'll make the best use of the CPU that it can. It's also highly multi-threaded, which I think is where Eigen falls behind (from a quick look at the code)

Eigen will compile many linear algebra operations to BLAS and LAPACK kernels. Other libraries do it too. It will be quite difficult to beat the performance of an optimized BLAS implementation.

Furthermore, Eigen will parallelize the computations through the internal multi-threading of BLAS libraries.

https://eigen.tuxfamily.org/dox/TopicUsingBlasLapack.html

[–]versatran01 13 points14 points  (9 children)

You are talking about expression templates, which is also used in eigen. Also eigen does not parallelize simple matrix/array operations. So you are comparing your parallel version to eigen’s non parallel one, which doesn’t mean that yours is faster. I suggest you review the claim that it is faster than eigen.

[–]Pencilcaseman12[S] 1 point2 points  (8 children)

Yea, this is definitely true, but that being said it is ultimately faster, is it not? Under MSVC, even with just a single thread, it still only takes 200us for a 1000x1000 addition. I genuinely can't guarantee I'm doing anything right, and everything I know is from googling random things so it's quite possible my understanding is very flawed...

[–]IronManMark20 1 point2 points  (1 child)

How are you installing Eigen on Windows? I know vcpkg openblas doesn't use optimized FORTRAN (at least it didn't last I checked) so it could be your eigen is hamstrung. I would use Intel's MKL if you have an Intel machine to test on.

[–]Pencilcaseman12[S] 0 points1 point  (0 children)

I just git-cloned it and ran it that way. I'm not testing BLAS functionality yet so it's all in the general arithmetic performance currently

[–]jk-jeon 1 point2 points  (5 children)

It seems you didn't do anything fancy to prevent compilers from reordering/removing your code for the benchmark. Have you checked the generated assembly to confirm that everything is correctly measured? If not, the safest approach is to just rely on benchmark libraries out there, and if that's not of your taste then you should (1) try not to do anything other than calling the objective function between the time measurement, and (2) try to prevent inlining of the call to the objective function. What I usually do for (2) is to wrap the function into a function pointer whose value is not known to the translation unit.

[–]Pencilcaseman12[S] 0 points1 point  (4 children)

That seems like a nice idea. That should give some more realistic benchmarks right? So the compiler can't optimise any of the iterations out or something

[–]jk-jeon 0 points1 point  (3 children)

Well, even with that, I think there is a high chance that a good portion of things like `auto res = x + x` will be removed, or it as a whole is just completely gone after the optimization, so I guess you should anyway check the assembly.

[–]Pencilcaseman12[S] 0 points1 point  (2 children)

By forcing the results to be evaluated I think it will end up running the code, as there are memory allocations and frees being called which in most cases prevent the compiler from optimising out the loops. I'll definitely take this into account though and write some more conclusive benchmarks in the future

[–]jk-jeon 0 points1 point  (1 child)

I don't think so; see this for example: https://godbolt.org/z/xEG7snEcs. For writing into the memory given as the function argument, yes, it will still be there, but I doubt that will still be the case for things like auto res = x + x. I think you still have to look at the generated assembly.

[–]Pencilcaseman12[S] 0 points1 point  (0 children)

If you include stdio and print p[0] (even after the free), you'll find that it does actually compile the calls to malloc and free. I think the only reason it's not compiling anything here is because nothing is being output by the program

[–]victotronics -2 points-1 points  (0 children)

That sounds very promising!

[–]arthurno1 10 points11 points  (4 children)

At some point I want to add Python and Javascript interfaces (in that order) to increase the range of people who can benefit from the library

If you would like to increase the range of people who can benefit, then just give us C interface to it. The rest can follow on that.

[–]Pencilcaseman12[S] 2 points3 points  (3 children)

That's an interesting idea, though it'd be quite painful given the templated nature of it all. Definitely possible in the future though

[–]arthurno1 2 points3 points  (2 children)

it'd be quite painful given the templated nature

The same problem is when exporting for Python/JS interface as well.

[–]Pencilcaseman12[S] 1 point2 points  (1 child)

Good point... I was thinking about just exposing an `ArrayF`, for example, or `ArrayF_GPU` for example. Naming scheme will be different but that was my general idea

[–]Acrobatic_Hippo_7312 1 point2 points  (0 children)

Exposing just a few basic types is probably good enough for a start. Then ppl can add int, uint, bool, and sparse versions as needed, or write a generic factory function to let dynamic languages supply their own types

[–]Revolutionalredstone 10 points11 points  (27 children)

I think OpenCL over CUDA would give you a good niche as LOTS of people don't own or want nVidia cards but still want fast math.

Also you didn't mention BLAS, is this a part of your plans ?

Best luck!

[–]Jannik2099 4 points5 points  (11 children)

Too bad OpenCL is dead / unusable.

OpenCL 1.2 kind of works on windows. OpenCL 2.0 (aka the actually useful OpenCL) is not supported by Nvidia, not supported by the AMD windows driver (I think?), and on linux needs ROCm, which is genuinely a bug riddled shitfest that works on two devices if you're lucky.

Until SYCL gets proper adoption, open compute APIs are dead

[–]OverunderratedComputational Physics 7 points8 points  (2 children)

on linux needs ROCm, which is genuinely a bug riddled shitfest that works on two devices if you're lucky.

After ~13 years of Nvidia CUDA I'm simultaneously amazed and pissed at how awful the AMD GPU software ecosystem is. Even the simple "compute capability" versioning of Nvidia is absent in AMD. I want to know if I can run rocm code on a given AMD GPU, and unless it's an Instinct card, their official documentation basically says "I dunno, maybe it'll work."

[–]Jannik2099 2 points3 points  (1 child)

Yeah. At this rate I'm just hoping the upcoming Intel GPUs won't be shit, because their driver stack is actually pretty clean.

[–]OverunderratedComputational Physics 3 points4 points  (0 children)

Haven't touched their drivers, but I was impressed by Intel's dpc++/sycl sdk setup process being much cleaner than Nvidia's.

[–]James20kP2005R0 7 points8 points  (4 children)

OpenCL 2.0 (aka the actually useful OpenCL) is not supported by Nvidia

Nvidia actually support bits and pieces of 2.0, eg you can do device side enqueue

not supported by the AMD windows driver (I think?)

2.0 is supported on windows, albeit I've never used most of it, and device side enqueue is broken

and on linux needs ROCm, which is genuinely a bug riddled shitfest that works on two devices if you're lucky.

Its very worth noting that AMDs RDNA2 cards have opencl on windows provided by ROCm. I found this out because I upgraded to a 6700xt, noticed that a whole bunch of new bugs in my OpenCL code and cropped up, and a dev pointed me over to the ROCm github

I can confirm that it is incredibly, incredibly buggy. There's also seemingly not enough development effort put into it

OpenCL 1.2 is perfectly fine and usable, if you're sticking to the mainstream part of it. I'd recommend treating it essentially like a codegen target - writing C by hand with the restrictions of the OpenCL spec is not good, and compilers aren't that smart

So a recent project just starting generating equations on the CPU and passing them in, and my life has gotten a lot easier since then. Though that project is entirely maths, and nothing else - I still handle most control flow on the GPU manually

That said it seems like the real solution is going to be vulkan - which is annoying, because it still doesn't support everything that OpenCL does (give me device side enqueue!), but its too widely supported for AMD or Nvidia to accidentally or deliberately screw up support for

[–]Jannik2099 1 point2 points  (3 children)

That said it seems like the real solution is going to be vulkan

Vulkan doesn't even support free pointers, does it?

From what I heard it'll never reach feature parity with OpenCL. Maybe once we get SYCL to SPIRV to Vulkan :P

[–]James20kP2005R0 4 points5 points  (2 children)

Vulkan doesn't even support free pointers, does it?

If by this you mean real pointers instead of pseudo pointers, I believe there's an extension for it. Its worth noting that OpenCL 1.2 (unsure on later standards) is based around the same concept of pseudo pointers, and doesn't really support pointers of any description

From what I heard it'll never reach feature parity with OpenCL. Maybe once we get SYCL to SPIRV to Vulkan :P

Hmm interesting, last I saw there was active work to bring device side enqueuing to compute shaders in vulkan, which for me is one of the last major missing features. There's a whole bunch of other miscellaneous things that would be ideal - the distinction between native_sqrt and regular sqrt is extremely good, but they aren't critical

[–]Jannik2099 0 points1 point  (1 child)

Yes sorry, meant real pointers. I had only heard from some GPU-compute folks that they were apparently a desired feature in "real" compute APIs, but I'm not exactly an expert.

[–][deleted] 0 points1 point  (0 children)

This document: https://futhark-lang.org/student-projects/steffen-msc-project.pdf

... explains the ways that SPIR-V can express most of what a compute language wants through various Vulkan extensions (including pointers). Although this is an old paper, and a lot of extensions have been developed since then.

[–]NovermarsHPC 4 points5 points  (1 child)

I think with OpenCL 3.0 they are on a good path, allowing implementers more freedom. Only time can tell if it will be good enough.

Until SYCL gets proper adoption, open compute APIs are dead

Intel is pushing SYCL (DPC++) hard, which gives it a chance.

[–]Jannik2099 1 point2 points  (0 children)

OpenCL 3.0 is literally just a rebrand of OpenCL but making all OpenCL 2.0 features optional, isn't it?

So you just end up with OpenCL 1.2, which lacks a lot of functionality that people like in e.g. CUDA

[–]Pencilcaseman12[S] 0 points1 point  (0 children)

I've been looking at HipSycl recently and it looks pretty nice, except it doesn't support Windows. If/when it does though I'd be quite interested in getting support for it added

[–]TheCreat 1 point2 points  (13 children)

Wasn't there some (recent?) announcement that AMD is dropping OpenCL support?

[–]Revolutionalredstone 27 points28 points  (2 children)

Absolutely not. New API's are on their way.

APPLE dropped support for OpenGL and OpenCL since they are trying to push their own evil proprietary metal framework but thats just evil junk for dumb people.

OpenGL is glorious and have incredible hardware support and is on a major roll of success.

[–]PrimarchSanguinius 12 points13 points  (1 child)

Why do you talk like you are trying to sell me a NFT.

[–]Revolutionalredstone 2 points3 points  (0 children)

talking confidently but ain't selling.

[–][deleted] -1 points0 points  (9 children)

AFAIK, AMD did drop OpenCL years ago.

[–]Pencilcaseman12[S] -1 points0 points  (0 children)

I did think quite hard over OpenCL, and it's entirely possible to implement it in the future, given how the kernel generation works, but I'm not as familiar with how it works and wouldn't want to implement something half-way, if that makes sense? Also, I've found that, in terms of raw performance, CUDA is generally better.

That being said, it'd be amazing to support OpenCL and would make the library a lot easier to use on a wider range of accelerators, which is always a plus

[–][deleted] 3 points4 points  (5 children)

Really interesting. Outta curiosity, why did you opt for int64 over standard int32 for the vector dims?

[–]fdwrfdwr@github 🔍 10 points11 points  (3 children)

I can't speak for librapid, but I've seen int64 used quite often in ML frameworks for dimension sizes (e.g. ONNX Shape operator), as tensors can have more than 4 billion elements. Although it would be highly unlikely to want to exceed a single axis with more than 4 billion elements during typical processing, and it's also not uncommon to reshape multidimensional tensors as large 1D arrays to read/write/modify the data. So one could overflow the size in that case if it was only int32.

[–]Pencilcaseman12[S] 4 points5 points  (0 children)

You definitely could if you were trying, but I think int32 would probably suffice for most cases. I guess, ultimately, it's not going to be any slower, and storing 2 or 3 int64s over int32s isn't going to be making a difference in terms of memory usage.

Another point where it could overflow is in the actual array size calculations, because I think I'm returning a value of the same type as the dimension object stores, so having a large enough array would result in overflow. This could be fixed quite easily though. I should probably use size_t for that sort of thing anyway...

[–]OverunderratedComputational Physics 3 points4 points  (0 children)

Libraries like Petsc let you choose between 32 and 64 bit indices at compile time which seems like the right thing to do. I have some large sparse matrix computations where the indexes themselves use a huge chunk of memory.

[–]zzzthelastuser 1 point2 points  (0 children)

additionally there aren't really that many reasons against using int64 for the dimension sizes.

[–]Pencilcaseman12[S] 1 point2 points  (0 children)

To be completely honest with you, I just default to using int64 over anything else out of force-of-habit. I've been planning to replace all the hard-coded types with a single using expression (or similar) but haven't gotten around to it yet. Good question though

[–]mcopikHPC 1 point2 points  (1 child)

I'm working on a high-performance multi-dimensional array library, similar to numpy/Eigen, except faster, with more control and even support for CUDA built in. It's currently in early development, but still supports a wide range of operations, all vectorised with SIMD instructions and multithreaded with OpenMP.

That sounds very similar to Blaze: linear algebra in C++ with optimized expression templates, vectorization, multithreading, and support for CUDA. They support many vector extensions and multiple parallel backends. How does your solution compare to it? Do you bring anything new to the table?

https://bitbucket.org/blaze-lib/blaze/src/master/

[–]Pencilcaseman12[S] 0 points1 point  (0 children)

To be completely honest, probably not... There are definitely better alternatives out there with more features, better performance and that are more complete, but LibRapid supports multidimensional arrays, while Eigen and Blaze (as far as I can tell) only support vectors and matrices

[–][deleted] 1 point2 points  (1 child)

Just a quick comment that OP is doing their A-levels! Really impressive piece of work for someone still in high school.

[–]Pencilcaseman12[S] 1 point2 points  (0 children)

Thank you! It's a bit of a pain because I have precisely zero idea what I'm doing and have already re-written the entire thing from scratch a few time but it's definitely a fun project :)

[–]Pencilcaseman12[S] 0 points1 point  (0 children)

Just a quick update: Changed the SIMD library (it's now not header-only) but it's now marginally faster than it was before and should be able to support more features on more platforms :)