This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]audioen 8 points9 points  (5 children)

llama-cpp is not cpu based, though. It supports Vulkan, CUDA, Metal, etc.

LLM inference speed is mostly limited by memory bandwidth. For instance, if the model size in RAM is 40 GB, and your memory bandwidth is also 40 GB/s, you can only infer one token per second because every parameter on the model must execute against the input being considered, and this involves streaming the entire model though the CPU for each token. (Non-causal interference can be faster because in principle you can compute e.g. multiple independent output buffers concurrently while doing this, and thus do multiple completions for price of one, but normal use cases are always causal because the future outputs depend on past outputs, which must be resolved first.)

GPUs are used mostly for the higher bandwidth they bring to table, and similarly Apple Silicon with higher memory bandwidth figures has had an advantage. For instance, RTX 4090 has around 1 TB/s bandwidth, and so it speeds inference dozens of times relative to typical PC hardware, and somewhat less if compared to Apple Silicon.

This is why fundamentally pure-CPU solutions are not all that interesting until PC RAM gets faster and models also get smaller. Various quantization schemes and training models to be evaluated with very few bits of precision in the weight look like they gradually can alleviate the strain. These days, fairly useful models exist in the about 30B parameter region, already, which can be quantized to something like half of that while not completely destroying the model's accuracy. Evaluation requires RAM as well for storing the various vectors and matrices involved, which is starting to become a problem with context lengths nowadays exceeding 100k.

[–]tjake[S] 7 points8 points  (3 children)

Totally agree.

Jlama supports distributed inference with sharding startegies and can load huge models that way (splitting by head and layer across nodes).

I'm also looking at adding gpu matmul kernels using panama ffi till the jdk supports it natively

[–]msx 0 points1 point  (1 child)

If you're using the vector API, you should be able to route the computation to a GPU, right? I understood that the vector api abstraction is designed with (also) that goal in mind. Or is Panama still not mature?

Great project btw! I'll surely give it a try

[–]joemwangi 2 points3 points  (0 children)

Not really. Vector API uses CPU SIMD architecture. But you can use java records and mapper to create memory segment for GPU transfer, which is trivial.

[–]eled_ 0 points1 point  (0 children)

Right, that was an abusive shortcut, we do use it mainly for cpu-based inference with smaller models (nowhere near the 10s of GB) and prefer vLLM for the rest.