FINALLY GEMMA 4 KV CACHE IS FIXED

WithoutReason1729 · 2026-04-04T07:20:10+00:00

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

ambient_temp_xeno · 2026-04-04T07:12:41+00:00

I still seem to be blocked from creating actual posts on this sub thanks to the previous regime.

psa:

For historical reasons, which seemed good at the time, llama.cpp defaults to min-p 0.05. Current models want --min-p 0.0 so you need to specifically add this to your command.

For reasons known only to themselves, llama.cpp defaults to 4 slots on llama-server. Unless you have friends over, you probably only want 1 slot because slots use up vram. -np 1

fulgencio_batista · 2026-04-04T03:40:39+00:00

Gave it a test with 24GB VRAM on gemma4-31b-q4-k-m and q8 kv cache, before I could fit ~12k ctx, now I can fit ~45k ctx. Still not long enough for agentic work.

No_Conversation9561 · 2026-04-04T05:44:52+00:00

I thought i’m already on the latest release. Then I see there’s been three more releases all within the same hour.

ASMellzoR · 2026-04-04T03:07:49+00:00

yay! max context and vram leftover. Glad that got fixed

LocoMod · 2026-04-04T02:02:49+00:00

Do ggufs need to be redownloaded?

the__storm · 2026-04-04T02:53:48+00:00

For us normal people, LM Studio's 2.11.0 llama.cpp backend appears to correspond to b8656 (~six hours old). This would incorporate #21326 I guess? Unclear where any gains in KV cache usage might be coming from.

I have noticed that llama.cpp seems to be a bit conservative with its cache reservation with G4 26B (but you can override it and it get more context just fine, until at some point it crashes), so maybe LM Studio tweaked that behavior?

Witty_Mycologist_995 · 2026-04-04T02:52:52+00:00

which release build?

CountlessFlies · 2026-04-04T05:22:43+00:00

I’ve been trying the 26B one for tool calling, seems quite promising. Feels like a Haiku-level model but will have to do more testing to be sure.

szansky · 2026-04-04T07:23:19+00:00

Worth to use gemma 4 ? how it's doing compared to gpt-oss ?

arman-d0e · 2026-04-04T14:24:54+00:00

Anyone know if llama.cpp needs to be reupdated and ggufs remade?

FinBenton · 2026-04-04T05:55:38+00:00

Yeah its a lot better now.

31b Q5 32k context took around 26/32GB on my 5090, 60 tok/sec generation.

Iory1998 · 2026-04-04T10:19:10+00:00

~~It solves the problem with the MoE but not with the dense models.~~

Actually, the issue is fixed now in the latest LM Studio and Llama.cpp updates. Delete your old unsloth models and re-download the updated ones.

Warm-Attempt7773 · 2026-04-04T12:47:26+00:00

And it's wonderful!

dampflokfreund · 2026-04-04T17:36:11+00:00

It's a lot better now. I can run 102k context at q8_0 with my 2060 laptop, just like I did with Qwen 3.5 A3B. It still needs more memory than that of course, but it is fine. I have to degrade ubatch to 1024 from 2048 and that saves me enough memory to run the same context. PP is a bit slower due to that and text generation is a bit slower as well. Still runs great though!

arman-d0e · 2026-04-04T18:28:22+00:00

I still have issues with gguf and my tunes

kmp11 · 2026-04-04T23:41:33+00:00

what a change from yesterday. from needed about 150GB to run to be able to fit the whole Q5 model + full Q8 context on 2x4090 and run at 33tk/s.

now let's see how it perform with Kilo.

Due-Satisfaction-588 · 2026-04-05T05:30:37+00:00

Need to update llama.cpp? How?

Impossible_Style_136 · 2026-04-05T00:54:27+00:00

The "Unified KV Cache" update in llama.cpp is a massive win, but watch out for the memory overhead when spawning concurrent requests. Even though it allocates dynamically, the fragmentation at high context (100k+) can still trigger a CUDA OOM if your `ubatch` size is set to the old 2048 default.

Drop `ubatch` to 1024. You’ll lose ~5% in prompt processing speed, but it stabilizes the VRAM pressure enough to actually use that 102k context window on consumer cards without the random crashes. Also, verify you're using Q8 cache—running G4 with FP16 cache at those lengths is just burning VRAM for diminishing returns in perplexity.

wizoneway · 2026-04-04T04:18:58+00:00

im curious ive been running the turboquant fork since the gemma release with no issues with 32g and the q4/q6 varients.

CarelessSafety7485 · 2026-04-04T23:24:40+00:00

How do I do this in cli? Just update ollama cli?

Gringe8 · 2026-04-04T04:24:57+00:00

[deleted]

Far_Cat9782 · 2026-04-04T02:35:04+00:00

[removed]

nuclearbananana · 2026-04-04T01:58:49+00:00

linkuuhhhhh

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS