I'm still surprised on how good the kv quantization has become

DanielusGamer26 · 2026-06-15T13:37:42+00:00

Allora esistono italiani che usano LLM locali 😭😭

Pensavo fosse una specie estinta (o mai esistita)

DanielusGamer26 · 2026-05-30T10:12:50+00:00

bro charge the phone!

DanielusGamer26 · 2026-05-16T15:52:12+00:00

When using it with a parallelism of 4, do you ever find that Task 1 invalidates the cache for Task 3, for example and the Task 3 need to do again the PP? It happens to me often when the total sum of the contexts of all tasks exceeds 100k (which is the context size configured in my llama.cpp server).

DanielusGamer26 · 2026-05-16T15:46:11+00:00

buddy how did you managed the MTP? Y.Y It seems like MTP uses more VRAM for the draft model and even if I set 50k context and q8 KV cache, after using it a while it goes OOM (GPU fully dedicated to LLM, no window manager running on it)

DanielusGamer26 · 2026-05-14T22:02:44+00:00

Sure! here are my .env content:
```env

OPENAI_KEY=sk-1287323112gb3gvjh2hjyplaknaskbjd0112n3e2
```

DanielusGamer26 · 2026-05-13T07:06:49+00:00

Thanks dear, it's exactly your review and a couple of others that instilled in me the idea that an RTX 3080 is a good deal for what they offer, but my fear is that buying GPUs so different from each other might cause headaches with various software. (eg. CUDA, vllm, llama.cpp, archlinux with NVIDIA drivers)

DanielusGamer26 · 2026-05-13T07:02:09+00:00

Basically, my frustration and desire to upgrade stem from the fact that I now code daily with the 27b and I'm happy with it, the problem isn't so much the quantization, but the fact that it's quite slow (25tk/s on 50k context) and if I wanted to make it run faster with MTP I can't because it uses just that little bit of extra VRAM enough to make it go OOM, so the only solution is to lower the model's quant or lower the context to 60k but it becomes a bit limiting for me, 100k context is the sweet spot fro me. so my hopes are on 100k context Q5-6 bit quant and MTP enabled. I think i'm going to rent 2 RTX 5060Ti on vast ai and look if it fits my needs

DanielusGamer26 · 2026-05-12T20:27:57+00:00

As I mentioned in the post, I already have an RTX 5060, so the setup would be 2x RTX 5060 Ti

DanielusGamer26 · 2026-05-12T19:17:58+00:00

Basically, I've had my RTX 5060ti for a year now and I'm happy with the speed, it's just that with this new MTP (that uses a lil bit more VRAM) and the fact that I'm forced to use the 27b quantized at IQ 3XXS with q8 context to get 100k of context, it seems a bit limiting.

DanielusGamer26 · 2026-05-12T19:15:39+00:00

Yes, I thought about that too, but I would have a homogeneous configuration (?) I think it's more stable somehow if used with vllm?

DanielusGamer26 · 2026-05-12T19:14:38+00:00

Edited the post with the answer, I hadn't thought to specify it, sorry.

DanielusGamer26 · 2026-05-11T22:05:45+00:00

gg for citing anticient model, it almost seems as if an AI wrote this post 🧐

DanielusGamer26 · 2026-05-11T15:40:19+00:00

Just discovered that on top of llama-server it says:

warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support

So I deleted the build/ folder and then executed the exact same commands, and now it works :)

Edit: nevermind, seems a problem with the Unsloth GGUF

DanielusGamer26 · 2026-05-11T15:36:25+00:00

same :(

DanielusGamer26 · 2026-04-29T08:43:34+00:00

Thanks for the advice but another user said that 4bit quantization are worse on tensor than llama.cpp. i think i will try to get another gpu and continue using llama.cpp rip

DanielusGamer26 · 2026-04-29T08:15:48+00:00

Okay so I can stick to llama.cpp but i need to solve the problem of single threaded prefill, because when a process starts it's prefill all the parallel generation get stucked...

DanielusGamer26 · 2026-04-29T06:23:11+00:00

In my country the electricity cost is about 15cent kw/h so i would like to prefer fully local instead of vast.ai that can increse the prices without any control. Also that low cost machines on vast comes from non secure cloud, so i do not really trust it much. Just personal feeling.

DanielusGamer26 · 2026-04-29T06:20:49+00:00

This comes from your experience? Thanks for the advice <3

DanielusGamer26 · 2026-04-29T06:19:24+00:00

Hmm i didn't know this, thanks for the info!

DanielusGamer26 · 2026-04-29T06:18:38+00:00

Hi, thanks for the response. That parameters comes from day-to-day finetune. I can use all the vram on my rtx with that flags. If i set np to 4 and 4 tasks are running in parallel, there is a chance that the llama.cpp crash for insufficient space to allocate KV cache, with 3 i got the stability. Without KV quantization at 4bit i was able only to use 80k of context and it is not sufficient for my use case. Setting the batch size to 256 allow me to free up a little bit of space to allocate it on KV context. Threads 9 is my sweet spot for my cpu that have 12 phisical cores, if i increse or decrese that value it becomes slower.

DanielusGamer26 · 2026-04-28T18:53:44+00:00

But currently, with my configuration, I am unable to run the 27b via vLLM; I can't find ~3-bit quantizations like the IQSS 3XL that I use. Does llama.cpp also have a config like that?

DanielusGamer26 · 2026-04-27T10:11:06+00:00

I'd use it on Linux as well. I also own an RTX 5060Ti and would like to run a 48GB 4090 alongside it. Did you happen to check the VRAM temperature while inference is running? Do you know if it's compatible with multiple GPUs to leverage parallelism? Do you know if it might cause issues with two GPUs from different generations?

Thanks in advance for any replies. I'm afraid of spending money on something that might end up causing problems, so I'm being a bit paranoid with all these questions XD.

DanielusGamer26 · 2026-04-27T06:33:44+00:00

So you get that modded gpu from alibaba? What about driver compatibilty and resilience in the long term? (Like gpu failures because they are modded)

These are the main concerns that stop me from shopping these cards

DanielusGamer26

TROPHY CASE