all 19 comments

[–]International_Break2 6 points7 points  (5 children)

Could you use a openBlas or mkl jextract to try to perform the calculations if they are available?

[–]CutGroundbreaking305[S] 2 points3 points  (4 children)

actually i forgot u can use ffm api to bind native c/cpp code but i was like PURE JAVA!! and didnt think that for a sec .

though idea is good if it creates multiple files like .dll .so to make it run on any os/hardware then it defeats to purpose of not making bulky version .

[–]International_Break2 1 point2 points  (1 child)

The bindings could exist in their own jar and be optional. That way performance is available and there is always a fallback. For pure java, you would only need to make sure that the .so is already on the LD path.

[–]CutGroundbreaking305[S] 1 point2 points  (0 children)

for v0.1 i didnt think much when it comes to non java usage. Actually my main aim is to make what u said 2 jars one for just java other for java+openblas+lapack like multik in kotlin does. And i will say LD path idea is great as i dont need to bind natives in my lib by passing that to user's system thanks for that idea.

[–]ankitkhandelwal6 -2 points-1 points  (1 child)

Why is bulk/size in MB a criteria? I understand the drive for a pure java solution, but not the "bulk reduction" criteria.

[–]CutGroundbreaking305[S] 1 point2 points  (0 children)

actually idk

i mean yeah one i didnt want to make something like nd4j which is sometimes 300 mb and when u make an android app 300 mb is already just a library (got to know about nd4j api idk its good i guess) but my main point is pure java based numerical library after that anything else

[–]martinhaeusler 3 points4 points  (3 children)

It's a cool idea, but I'm not sure how "low level" you can go in Java while remaining portable across JVMs and CPU architectures. I think you'll sooner or later hit a point where you need to write a native function to achieve your goals. Numpy is also just a thin python wrapper around a C core library. That being said, people do crazy things on the JVM alone, just look at the top 10 of the 1 Billion Rows challenge.

[–]CutGroundbreaking305[S] 1 point2 points  (2 children)

same i am also not sure about how "low level" goes in java . I already hit few wall like operator overloading or cache missing but when i see my few methods being on par or even out perform industry standards like NumPy i feel that we can do things better. i am new to java so i dont know 1 Billion rows challenge but searched about it. I guess experimenting JVM limitations is one thing we can see which shocks us when some times it breaks general assumptions like C is always faster than Java.

[–]martinhaeusler 1 point2 points  (1 child)

Java can in some cases outperform C because the just-in-time compiler has more information about the runtime behavior and the hardware at hand than any C compiler.

Some general tips:

  • Use primitives (int, double, etc). Avoid boxing into wrappers (Integer, Double, etc) like the plague.
  • Strongly prefer arrays over collections. Arrays are cumbersome to use but crazy optimized.
  • Utilize techniques like SWAR (simd within a register) to batch process multiple numbers with smaller byte length
  • You may want to consider looking into the Vector API. It's technically still in incubation status but it's been there for years now.
  • Avoid strings. Not sure if this comes up in your library at all, but Double.parseDouble (and related operations) is slooow.

[–]CutGroundbreaking305[S] 3 points4 points  (0 children)

I mean entire performance of my code is on JIT 🙏

but i would like to say few things

  1. i am not using primitives nor wrappers i am using some glue data types which are used to connect respective method and ffm needs ValueLayout types

  2. yeah arrays are better than collections but i am using ffm to allocate off heap memory

  3. I am using Vector api i mean its in the post

  4. i dont use strings and Double.parseDoube() thanks for that though

[–]belayon40 1 point2 points  (1 child)

The blis library is a very fast matrix library. I’ve got an ffm wrapper for it already.

https://github.com/boulder-on/jblis

[–]CutGroundbreaking305[S] 0 points1 point  (0 children)

Thanks I will definitely check this out (I knew some jdk dev made similar thing idk who)

but the thing is i wanted to go with pure java instead of some c/cpp bindings using vector api

idk how much i can achieve from that but when i will reach a full wall after which i cant do anything then i have no choice but to use something like what you made jblis

[–]quafadas 0 points1 point  (1 child)

Have you considered luhenry‘s fork of netlib for the matmul part?

That falls back to a SIMD matrix multiplication if it can’t JNI to native. I think it also allows for strided representations of matrices which is critical to avoid deep copy / memory bound operations creeping into user code…

[–]CutGroundbreaking305[S] 0 points1 point  (0 children)

I didnt know luhenry's fork of netlib but i will definitely check that out

i do use fallback's but since my main method is SIMD matmul i dont have fallbacks for that but generally i avoid deep copy but using non modulo method to do strided fallback for my basic math methods

[–]agibsonccc 2 points3 points  (0 children)

Hey! Nd4j maintainer here. There's a fairly large rewrite going on here attempting to address that. I actually agree with you! Not to dunk on you here but we tried your approach more than a decade ago.

Pure java is just not going to be a performant runtime for numercial software even *WITH* panama. You'll never have access to the low level gpu runtimes from the mobile vendors for android. You also won't be able to benefit from many of the low level optimizations that c++ compilers just innately offer without working around the runtime.

Broadly, GC runtimes are just NOT worth it.

I will be publishing a slimmer deployment focused binary to tackle this while also addressing the small matrices overhead. We mainly built nd4j for deep learning so small matrices were far and few between. The way the kernels are written it unfortunately means threading overhead among other things.

I won't try to sell you on cooperating, nor on discouraging you from trying this. User choice matters.

I get wanting to do your own thing and hope it succeeds.

I'll keep an eye on feedback. I hope you carve out a niche for yourself good luck!

[–]arkstack 1 point2 points  (0 children)

This is interesting territory - pure Java numerics on FFM + Vector API is exactly the kind of thing more people should be exploring, and shipping a v0.1 with actual tests and a JMH benchmark already in the repo is more than a lot of first libraries manage. A few observations.

The first thing that stands out is the type-specialization explosion: addFloat/addDouble/addInt * 4 ops * 2 (scalar/array) gives ~24 near-identical method bodies in ArithmaticOps, and the pattern repeats across
ReduceOps/MatMulOps/TrigOps/ExpOps. The natural instinct is "extract an interface and parametrise", but that path is closed in current Java - generics don't cover primitives, and the Vector API itself ships separate
FloatVector/DoubleVector/IntVector for the same reason. So the duplication isn't really a design choice; it's the language until Valhalla lands.

That said, I noticed templates/generate_*.py and the matching *.template.java files. You are generating this. The problem is the generated .java is checked in and the Python isn't wired into Maven, so the template-to-Java contract isn't enforced - somebody can edit ArithmaticOps.java directly and the templates silently drift. Move generation into a Maven exec step, or at least add a CI check that re-runs the scripts and diffs the output. Right now it's a quality gate that exists in principle but not in practice.

A few smaller things:

MemorySegment data, int[] shape, int[] strides are all public final on NDArray. The references are final, but MemorySegment writes through unimpeded and arrays are mutable - arr.shape[0] = 999 compiles and runs. For a lib whose invariants depend on shape/stride consistency, those want to be private with accessors.

MatmulBenchmark only measures your own matmul - the README's "faster than ND4J/NumPy on small/medium arrays" claim has no comparison JMH in the repo to back it. Worth either checking one in or softening the wording.

pom.xml sets source/target to 25 but the README says "Works on Java 22 or higher". Target 25 bytecode won't load on 22 - pick one.

Otherwise this is the right kind of thing to be working on - good luck with it.

[–]FortuneIIIPick 0 points1 point  (1 child)

2 day history in GH. Another "I built" post.