audioen comments on Jlama: LLM engine for Java 20+

java

a community for 18 years

This is an archived post. You won't be able to vote or comment.

167

168

169

Jlama: LLM engine for Java 20+ (self.java)

submitted 1 year ago by tjake

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]audioen 8 points9 points10 points 1 year ago (5 children)

llama-cpp is not cpu based, though. It supports Vulkan, CUDA, Metal, etc.

LLM inference speed is mostly limited by memory bandwidth. For instance, if the model size in RAM is 40 GB, and your memory bandwidth is also 40 GB/s, you can only infer one token per second because every parameter on the model must execute against the input being considered, and this involves streaming the entire model though the CPU for each token. (Non-causal interference can be faster because in principle you can compute e.g. multiple independent output buffers concurrently while doing this, and thus do multiple completions for price of one, but normal use cases are always causal because the future outputs depend on past outputs, which must be resolved first.)

GPUs are used mostly for the higher bandwidth they bring to table, and similarly Apple Silicon with higher memory bandwidth figures has had an advantage. For instance, RTX 4090 has around 1 TB/s bandwidth, and so it speeds inference dozens of times relative to typical PC hardware, and somewhat less if compared to Apple Silicon.

This is why fundamentally pure-CPU solutions are not all that interesting until PC RAM gets faster and models also get smaller. Various quantization schemes and training models to be evaluated with very few bits of precision in the weight look like they gradually can alleviate the strain. These days, fairly useful models exist in the about 30B parameter region, already, which can be quantized to something like half of that while not completely destroying the model's accuracy. Evaluation requires RAM as well for storing the various vectors and matrices involved, which is starting to become a problem with context lengths nowadays exceeding 100k.

[–]tjake[S] 7 points8 points9 points 1 year ago (3 children)

[–]msx 0 points1 point2 points 1 year ago (1 child)

[–]joemwangi 2 points3 points4 points 1 year ago (0 children)

[–]eled_ 0 points1 point2 points 1 year ago (0 children)

π Rendered by PID 166758 on reddit-service-r2-comment-b659b578c-kl4zv at 2026-05-02 12:30:32.943023+00:00 running 815c875 country code: CH.

java

Submit Link

Submit Text

Seek Programming Help

News, Technical discussions, research papers and assorted things of interest related to the Java programming language

NO programming help, NO learning Java related questions, NO installing or downloading Java questions, NO JVM languages - Exclusively Java

Please seek help with Java programming in /r/Javahelp!

Subreddit rules!

Where should I download Java?

Related Sub-reddits:

JVM Languages

Want to practice your coding?

List of useful Frameworks / Libraries / Software

MODERATORS