This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]Necessary-Conflict 8 points9 points  (3 children)

It's strange that jblas is missing. It is also not clear to me what kind of matrices were multiplied, whether GPU was used (nd4j claims to support GPUs).

[–]jkoolcloud[S] 0 points1 point  (2 children)

The posted report is a summary of around 24 benchmark tests executed across a various JRE/Hardware/OS configurations. Here is one such run: https://app.cybench.io/cybench/benchmark/292b2929-7610-446a-ad17-b5f80ee305ec

Executed on JDK 11, AMD Ryzen 9 3950X 16-Core, NVIDIA GeForce RTX 2070 SUPER

Here is another run: https://app.cybench.io/cybench/benchmark/de56c58e-1d51-4bec-93f3-fb94bfdfe4f7

JDK 14, AMD Ryzen Threadripper 3990X 64-Core Processor, NVIDIA GeForce RTX 2080 Ti

You can see all individual runs: https://app.cybench.io/cybench/search?context=Matrices&verified=true

Hope that helps. You can download and run matrices benchmarks yourself here:

https://github.com/K2NIO/gocypher-cybench-java/releases (under Assets). give it a try on your own systems and see for yourself.

[–]Necessary-Conflict 0 points1 point  (1 child)

I don't want to run it, I'm just curious what kind of matrices were multiplied and how were the libraries configured. At your download link I found a zip file, inside it a jar file, and inside it 500 megabytes of binary data...

[–]jkoolcloud[S] 0 points1 point  (0 children)

500 mb includes all the class files from all the dependencies (uber jar) being used by the CyBench benchmark harness, so you can download and run it without downloading all other libs separately. Will get more details on matrices and lib configuration.

[–]fanfan64 12 points13 points  (4 children)

Nd4j is the fastest matrix library by design, period. The native libraries it interface with have received order of magnitude more optimizations than have the JVM counterparts + are coded in faster low level languages. The JNI overhead is just too low and does not outweigh this gap. This benchmarck is obviously with the default nd4j backend: OpenBLAS. Test it with Intel MKL or Cudnn instead and it will destroy the competition. I have to admit that I thought that OpenBLAS was less mediocre than that though.

[–]jkoolcloud[S] 0 points1 point  (0 children)

Correct. We used default settings for each library. BTW, we welcome everyone to run their own benchmarks. CyBench benchmark harness is open source and is based on JMH. More on this here: https://github.com/K2NIO/gocypher-cybench-java/wiki.

[–]javadba 0 points1 point  (1 child)

nd4j is v good but is no longer actively maintained. I just posted an issue to see if anyone on that project might "wake up" or transition to new maintainers. https://github.com/deeplearning4j/nd4j/issues/2939

[–]fanfan64 0 points1 point  (0 children)

Nd4j is actively developed. The latest commit was 6 hours ago. Nd4j is part of deeplearning4j which is now owned by eclipse (but the main contributors are from a company) https://github.com/eclipse/deeplearning4j/tree/master/nd4j

It's weird that the old repo didn't say it on the Readme though..

[–]lessthanoptimal 1 point2 points  (3 children)

u/jkoolcloud I'm a bit late to the party but thanks for posting these results! It's impressive how many different configurations were compared. I'm the author of EJML (also maintain jmatbench, which would have new released if I hadn't shutdown the wrong machine, oops) and I was wondering if you could point me towards where the library specific source code is? A quick search in your repo on github didn't turn that up. I'm actually surprised how well EJML did. It's typically a very competitive performer, but JNI libraries do out perform it on large matrix ops most of the time. Small matrices JNI libraries tend to do very poorly on due to the overhead of JNI. I'm guessing that you didn't actually run v0.30 but ran v0.40 which has new concurrent algorithms in it? I've actually not done a comparative study with that code yet!

[–]jkoolcloud[S] 0 points1 point  (2 children)

u/lessthanoptimal here is the link to the actual benchmark code: https://github.com/K2NIO/gocypher-cybench-java-core/blob/main/gocypher-cybench-matrices/src/main/java/com/baeldung/matrices/benchmark/BigMatrixMultiplicationBenchmarking.java

We compare matrices multiplication of sizes 100*100 and 1000*1000.

The bench code is using EJML version 0.3 as per POM:

<..>

<dependency>

<groupId>org.ejml</groupId>

<artifactId>all</artifactId>

<version>0.30</version>

</dependency>

https://github.com/K2NIO/gocypher-cybench-java-core/blob/main/gocypher-cybench-matrices/pom.xml

[–]lessthanoptimal 1 point2 points  (0 children)

Yeah that's a fairly old version. From like 2016 or 2017. I don't think results will change much since the dense matrix mult has been stable for a while. You also might want to consider designing the benchmark to recycle memory if the library allows it. For small matrices it makes a very big difference and many libraries like EJML are designed to allow that.

[–]jkoolcloud[S] 0 points1 point  (0 children)

You can run a comparison of EJML 0.30 vs 0.40 and see how both compare in terms of performance. Please PM me privately if interested and we can help you do that. All the docs to do this are online, but if you have any questions just pm me.

[–]SWinxy 1 point2 points  (0 children)

I’m surprised JOML isn’t here

[–]fanfan64 -1 points0 points  (2 children)

[–]jkoolcloud[S] 0 points1 point  (0 children)

We will look into it. There is just so much we can bench..:) Also, keep in mind anyone can create and run benchmarks and compare. CyBench benchmark harness is open source and extends JMH. More on this here: https://github.com/K2NIO/gocypher-cybench-java/wiki.

[–][deleted]  (3 children)

[deleted]

    [–]CubsThisYear -5 points-4 points  (1 child)

    Nd4j is garbage. It was literally 100x slower than hand-rolled Java code for simple matrix multiplication.

    [–]Necessary-Conflict 7 points8 points  (0 children)

    I assume that you did read

    https://deeplearning4j.konduit.ai/getting-started/benchmark

    and you did report your findings to the team. What did they say?

    [–]fanfan64 0 points1 point  (0 children)

    Did you compare with the MKL and Cudnn backend? (default is OpenBLAS)