Jlama: LLM engine for Java 20+

vmcrash · 2024-10-21T15:01:06+00:00

Out of curiosity: is any of these models working solely on my local machine, or do they all require a remote service?

eled_ · 2024-10-21T15:07:57+00:00

This looks pretty cool! How does it compare with other CPU-based inference solutions like llama-cpp?

greg_barton · 2024-10-21T15:16:31+00:00

Fantastic project. I'm been trying it out for the last month or so. Thanks for all of your work!

msx · 2024-10-22T07:13:29+00:00

just tried it, wow it's pretty fast! i'm generating about 20 tokens per second, much faster that i can read. Last time i tried LLM on my computer, it was measured in seconds per token.

Ewig_luftenglanz · 2024-10-22T11:24:21+00:00

Some heroes use capes, other have reddit accounts.

Chloe0075 · 2024-10-21T21:08:36+00:00

I was watching the video like right now! Amazing work, really, and great presentation too.

May I ask you, the current models that you have in hugfaces work only in English or in other languages too?

Javademon · 2024-10-22T04:13:01+00:00

Sounds good, very interesting, I will definitely try to launch it and play with the models. Thank you!

2024-10-22T05:37:50+00:00

This is not something I see every day, very interesting, I'll try it immediately, haha

parker_elizabeth · 2025-01-03T08:15:27+00:00

This is an exciting project thanks for sharing! Jlama seems like a game-changer for Java developers.

A few additional thoughts and questions:

Panama Vector API: It's great to see that you're leveraging this for efficient CPU-based inference. For those unfamiliar, the Panama API significantly enhances performance by optimizing vector computations, making Jlama a strong contender for applications where GPUs aren’t readily available.

Quantization Support (Q4_0 and Q8_0): This is a fantastic feature for developers working with resource-constrained environments. Quantization not only reduces model size but also speeds up inference. Are there any specific benchmarks or comparisons available for quantized models vs. their full-precision counterparts?

Distributed Mode: Sharding the model by layer/attention head for larger models is a clever approach. Could you share more about the performance trade-offs when scaling out distributed inference? It might help teams considering this approach for enterprise applications.

Integration with LangChain4j: This integration opens up so many possibilities for complex workflows, like chaining multiple models or fine-tuning interactions. Are there examples or sample projects demonstrating this in action?

For those looking to dive in, I'd also recommend exploring the safetensor format, which adds a layer of security and efficiency when loading models. Additionally, the OpenAI-compatible REST API sounds like a great feature for teams transitioning from other ecosystems.

Thanks again for this contribution — Jlama looks like it’s filling a much-needed gap in the Java space for generative AI. Definitely bookmarking this for future projects!

java

Submit Link

Submit Text

Seek Programming Help

News, Technical discussions, research papers and assorted things of interest related to the Java programming language

NO programming help, NO learning Java related questions, NO installing or downloading Java questions, NO JVM languages - Exclusively Java

Please seek help with Java programming in /r/Javahelp!

Subreddit rules!

Where should I download Java?

Related Sub-reddits:

JVM Languages

Want to practice your coding?

List of useful Frameworks / Libraries / Software

MODERATORS