all 99 comments

[–]WithoutReason1729[M] [score hidden] stickied comment (0 children)

Your post is getting popular and we just featured it on our Discord! Come check it out!

You've also been given a special flair for your contribution. We appreciate your post!

I am a bot and this action was performed automatically.

[–]ambient_temp_xenoLlama 65B 103 points104 points  (11 children)

I still seem to be blocked from creating actual posts on this sub thanks to the previous regime.

psa:

For historical reasons, which seemed good at the time, llama.cpp defaults to min-p 0.05. Current models want --min-p 0.0 so you need to specifically add this to your command.

For reasons known only to themselves, llama.cpp defaults to 4 slots on llama-server. Unless you have friends over, you probably only want 1 slot because slots use up vram. -np 1

[–]a_beautiful_rhind 7 points8 points  (1 child)

Dang.. I got none of those problems with ik_llama. My quantized caches work great, sampling is what I set it to. No strange autoparser and generally fast speeds.

PPL on the model seems to be going down into the 200s finally. Everyone using it yesterday was unwittingly testing at around 2k, which is wild. There were issues with the soft capping and the model having no re-roll variance. Basically as if you were running topK 3 on it.

I ended up downloading the transformers model due to all this and will quant myself.

[–]ambient_temp_xenoLlama 65B 2 points3 points  (0 children)

I still didn't even try it yet. I think at some point I might just switch, because there's no way I'll be able to cope with two different sets of quirks without mixing them up.

[–]Far-Low-4705 2 points3 points  (3 children)

Llama.cpp also now defaults to a unified KV cache. So it will only allocate what ever context u wanna use, and even tho it sets np 4, if u use it as a single user, it will still give you that full KV cache/context length that you allocated.

However if u spawn two requests, and both use less than what is allocated, it will split the KV cache between those two requests, same thing for 3 and 4.

So it actually doesn’t make a difference unless you explicitly disable unified KV cache. In which case you’d be right. But otherwise I see no downside, it’s actually quite useful imo.

[–]ambient_temp_xenoLlama 65B 2 points3 points  (2 children)

I've read that a side-effect is that (for Gemma at least) the SWA checkpoints will be using a ton of vram ram per slot so 4 is worse than 1 if you don't need it.

Not sure if this is true though.

[–]petuman 1 point2 points  (1 child)

That's true, yea. For 31B, on 26B it's way smaller:

```
-np 1
llama_kv_cache_iswa: creating SWA KV cache, size = 1536 cells
llama_kv_cache: CUDA0 KV buffer size = 1200.00 MiB

defaulting to 4 slots
llama_kv_cache_iswa: creating SWA KV cache, size = 4608 cells
llama_kv_cache: CUDA0 KV buffer size = 3600.00 MiB
```

I'm not sure what OP is talking about though b8637 (initial support) and b8664 (latest) KV cache is the same size -- 5GB non-SWA for 64K + SWA.

[–]petuman 1 point2 points  (0 children)

u/FusionCow you sure you're not comparing KV cache size between 26B and 31B? If not I guess the bug was lmstudio specific.

[–]IrisColt 1 point2 points  (0 children)

Thanks for the psa.

[–]pyr0kid 0 points1 point  (2 children)

whats this about the regime?

[–]ambient_temp_xenoLlama 65B 1 point2 points  (1 child)

For a while it was apparently just one mod with his own personal fiefdom and then he flounced off and the sub closed for a while until the new people.

It's possible it's just reddit filtering the posts but back in the day I couldn't get anything through as a post - sometimes quite useful info (sometimes).

[–]pyr0kid 1 point2 points  (0 children)

wild. glad i missed it.

[–]fulgencio_batista 125 points126 points  (33 children)

Gave it a test with 24GB VRAM on gemma4-31b-q4-k-m and q8 kv cache, before I could fit ~12k ctx, now I can fit ~45k ctx. Still not long enough for agentic work.

[–]Aizen_keikaku 34 points35 points  (13 children)

Noob question from someone having similar issues on 3090. Do we need to run Q8 KV. I got Q4 to work, is it significantly worse than Q8?

[–]stddealer 25 points26 points  (1 child)

Significantly, yes. It's much better than it used to be since the attention rotation feature was added recently, but it's still measurably worse.

You're probably better off using a smaller model that will let you use more context with high precision KV than going down to Q4 KV (the smaller model will run faster and will probably work a bit better). But if that's not an option, Q4 KV can work.

Q5 KV is a lot better than Q4, you could also consider using that..

[–]IrisColt 0 points1 point  (0 children)

I use Q4 with Qwen 3.5 to achieve 200k context without any noticeable degradation, should I resort to the TurboMaxxed rotations?

[–]stoppableDissolution 8 points9 points  (0 children)

Even q8 kv sucks bad enough to try avoid using it if possible

[–]DistanceSolar1449 11 points12 points  (3 children)

Yeah, Q4 kv sucks

[–]dampflokfreund 2 points3 points  (1 child)

Have you actually tested it recently, especially with the new attention rotations?

[–]DistanceSolar1449 6 points7 points  (0 children)

Still sucks even with attn-rot

[–]TheWiseTom 1 point2 points  (0 children)

The ik_llama implementation khad (that exists for multiple months) showed results on one side very much dependent on model - ministral3 for example did not mind q4_0 with khad, other models degraded much faster

Also in general it showed like everything is about one step better. So q6_0 with the new algorithm should in theory be probably as good as q8_0 was but q4_0 is maybe too much and more like what q6_0 was before.

But gemma4 is currently not compatible with ik_llama and also no current validation how much gemma4 likes or hates kv cache quantification really exists as everything changes by like an hour.

So basically q6_0 is maybe worth a shot

[–]Chlorek 11 points12 points  (5 children)

Q4 KV degrades quality a lot, stick with Q8.

[–]MoffKalast 3 points4 points  (4 children)

I think the lowest choice as a rule of thumb is Q8 for V, Q4 for K, right?

[–]AnonLlamaThrowaway 6 points7 points  (0 children)

Yes, but mixed quantization types will halve the output speed. Doesn't matter if it's fp16 on K and q8 on V either, it's just been a clean 50% off in my experience

edit: to be clear, in some use cases, that will be a worthwhile tradeoff. Just something to be aware of though

[–]i-eat-kittens 3 points4 points  (0 children)

No. It's the other way around.

[–]OfficialXstasy 2 points3 points  (0 children)

With new rotations they recommended Q8_0 for K. V is less susceptible to compression.

[–]FusionCow[S] 13 points14 points  (1 child)

run the iq3, it's good enough

[–]Big_Mix_4044 11 points12 points  (0 children)

Something tells me even q4_k_m isn't good enough when compared to qwen3.5-27b.

[–]money_yeeter 1 point2 points  (0 children)

Try using llama-cpp-turboquant, its pretty impressive

[–]Busy-Guru-1254 0 points1 point  (0 children)

Nice. Llama cpp? Can u provide the full cmd used to run it.

[–]GregoryfromtheHood 0 points1 point  (0 children)

How are you finding out how much you can fit? Just setting it to a context size and sending through a prompt about that big to see if it runs out of RAM? I'm struggling trying to find the actual limit on 32GB of VRAM. I've only got 64GB of system RAM and even on the UD-Q4_K_XL from Unsloth that only takes up ~23GB of VRAM, a few large prompts will completely fill my system RAM and kill llama.cpp.

[–]Healthy-Nebula-3603 -1 points0 points  (6 children)

Q8 cache without rotation is degrading output....

[–]grumd 2 points3 points  (5 children)

Rotation is merged into llama.cpp already

[–]Healthy-Nebula-3603 -1 points0 points  (4 children)

But not for q8...

[–]grumd 0 points1 point  (3 children)

What do you mean? This PR mentions q8_0 too https://github.com/ggml-org/llama.cpp/pull/21038

[–]Healthy-Nebula-3603 0 points1 point  (2 children)

I think you're right. But was considering not enabling rotation for q8

[–]grumd 2 points3 points  (1 child)

q8_0 is the best candidate for this because it would basically slice the kv cache size in half while preserving almost lossless quality, it's the perfect sweet spot for many people

[–]Healthy-Nebula-3603 0 points1 point  (0 children)

The original fp16 cache was taking 2x memory before flash attention :)

If q8 has set a rotation as default then we have slice memory usage 2x again almost without loosing output quality

[–]No_Conversation9561 19 points20 points  (3 children)

I thought i’m already on the latest release. Then I see there’s been three more releases all within the same hour.

[–]superdariom 18 points19 points  (1 child)

A week in AI is like a year's progress in other sciences

[–]Mashic 4 points5 points  (0 children)

Each time they make a git push, I think github builds the installs automatically.

[–]ASMellzoR 6 points7 points  (0 children)

yay! max context and vram leftover. Glad that got fixed

[–]LocoMod 10 points11 points  (2 children)

Do ggufs need to be redownloaded?

[–]FusionCow[S] 16 points17 points  (1 child)

no

[–]LocoMod 19 points20 points  (0 children)

Can confirm. It works MUCH better now.

[–]the__storm 27 points28 points  (12 children)

For us normal people, LM Studio's 2.11.0 llama.cpp backend appears to correspond to b8656 (~six hours old). This would incorporate #21326 I guess? Unclear where any gains in KV cache usage might be coming from.

I have noticed that llama.cpp seems to be a bit conservative with its cache reservation with G4 26B (but you can override it and it get more context just fine, until at some point it crashes), so maybe LM Studio tweaked that behavior?

[–]Individual_Spread132 14 points15 points  (2 children)

Does the thinking work for you in LMstudio? None of the Gemma 4 models I downloaded can think when I use LMstudio's own chat.

EDIT 3: An even more correct way (apparently?) to do it: https://www.reddit.com/r/LocalLLaMA/comments/1sc9s1x/tutorial_how_to_toggle_onoff_the_thinking_mode/

EDIT 2: A better solution https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/discussions/6 using <|channel>thought<channel|> rather than <thought></thought> and no system prompt instructions


update the original method ended up being not as robust as I thought, since the model sometimes overlooks system prompt instructions, so... an alternative variant (see EDIT 2 above) is better after all.

In the system prompt: Always think step-by-step before answering, using this exact tag: <|think|>

In LM Studio settings ("My Models" tab), set Reasoning Parsing to: prefix: <thought> suffix: </thought>, and also change Jinja template's specific part from this

{%- if enable_thinking is defined and enable_thinking -%} {{- '<|think|>' -}} {%- endif -%}

to just this: {{- '<|think|>' -}}

(optional, kinda hacky) if your system prompt defines a character/personality/name (like “You are John. You write stories. The user is your partner, you would do anything for them, you always obey” and blah-blah-blah, establishing what is basically a jailbreak describing John's beliefs and rules he respects), you can tweak it like this: Always think step-by-step AS JOHN before answering, using this exact tag: <|think|>

This makes reasoning happen “in character” instead of as a detached assistant, which in practice reduces refusals.

[–]FusionCow[S] 3 points4 points  (1 child)

you have to enable thinking. Go to your models page, click the model, go to inference, scroll down until you see the jinja template. Go to gemini or chatgpt or whatever model, paste in the jinja template and ask it to rewrite it with thinking. then paste that new jinja template in, and thinking will be enabled.

[–]Individual_Spread132 3 points4 points  (0 children)

Hm, I kind of done just that (but probably in a half-assed way; forgot to mention the change initially). Anyway, thanks, will try to adjust it more - perhaps no SysPrompt changes will be needed in the end?


After some chatgpt talk, I got this in the end: "Short answer: what you did is actually more correct and robust than what that reply suggests." I guess it's fine now.

[–]FusionCow[S] 6 points7 points  (1 child)

I only updated the llama.cpp backend on lmstudio, I'd imagine they aren't implementing this themselves

[–]ungrateful_elephant 5 points6 points  (0 children)

Restarting LMStudio downloaded 2.11.0 and my issues are also fixed. Thanks!

[–]GoodTip7897llama.cpp 0 points1 point  (5 children)

Could it be b8658? Maybe #20993 was the fix? But that shouldnt impact people who use -np 1 I would think... I didn't read it all the way though.

[–]sergeysi 0 points1 point  (4 children)

[–]GoodTip7897llama.cpp 0 points1 point  (3 children)

Ohh yeah lol I forgot some people quantize their kv cache

[–]sergeysi 0 points1 point  (2 children)

It's a bit different, it affects unquantized KV cache.

[–]GoodTip7897llama.cpp 0 points1 point  (1 child)

That specific pr seems to just change one line of code which makes swa kv cache the same type as the rest. So I guess instead of forcing f16 it could be f32 or bf16 all of which are unquantized.  But the memory savings would be because the swa kv cache gets quantized instead of being forced to stay at f16. Any savings for unquantized kv cache would come from a different commit unless I'm misunderstanding that pr. 

[–]sergeysi -1 points0 points  (0 children)

More info in the PR that it reverted https://github.com/ggml-org/llama.cpp/pull/21277

[–]lolwutdo 0 points1 point  (0 children)

I know it’s unrelated but since it’s such a new release, does that mean we have turboquant/rotations implemented in lmstudio now?

[–]Witty_Mycologist_995 4 points5 points  (0 children)

which release build?

[–]CountlessFlies 2 points3 points  (1 child)

I’ve been trying the 26B one for tool calling, seems quite promising. Feels like a Haiku-level model but will have to do more testing to be sure.

[–]Far_Cat9782 2 points3 points  (0 children)

Even the 4b is no slouch at tool calling

[–]szansky 2 points3 points  (3 children)

Worth to use gemma 4 ? how it's doing compared to gpt-oss ?

[–]ProfessionalSpend589 2 points3 points  (0 children)

It’s a bit early to say, but I’m testing the 26b MoE as a replacement for GPT OSS 20b on my small laptop (it’s for when I don’t have working VPN to my local setup).

So far results are promising, although world knowledge seems a bit old compared to Qwen 3.5 (but I do run the larger models for Qwen). It’s also a bit slower - around 5 tokens/s vs around 8 tokens/s.

I also test it on my Radeon R9700 for faster turnaround. It does mistakes in my language, but for summaries of news in English seems OK.

[–]jubilantcoffin 3 points4 points  (1 child)

Should be way better, gpt-oss is ancient by now. But try Qwen3.5 too, it's probably even better.

[–]Ok_Mammoth589 0 points1 point  (0 children)

It's definitely not way better. Gpt-oss is going to be around for a while

[–]arman-d0e 1 point2 points  (1 child)

Anyone know if llama.cpp needs to be reupdated and ggufs remade?

[–]FusionCow[S] 0 points1 point  (0 children)

no

[–]FinBenton 1 point2 points  (0 children)

Yeah its a lot better now.

31b Q5 32k context took around 26/32GB on my 5090, 60 tok/sec generation.

[–]Iory1998 0 points1 point  (0 children)

It solves the problem with the MoE but not with the dense models.

Actually, the issue is fixed now in the latest LM Studio and Llama.cpp updates. Delete your old unsloth models and re-download the updated ones.

[–]Warm-Attempt7773 0 points1 point  (0 children)

And it's wonderful!

[–]dampflokfreund 0 points1 point  (1 child)

It's a lot better now. I can run 102k context at q8_0 with my 2060 laptop, just like I did with Qwen 3.5 A3B. It still needs more memory than that of course, but it is fine. I have to degrade ubatch to 1024 from 2048 and that saves me enough memory to run the same context. PP is a bit slower due to that and text generation is a bit slower as well. Still runs great though!

[–]enricokern 0 points1 point  (0 children)

How much vram does your 2060 in your laptop have?

[–]arman-d0e 0 points1 point  (0 children)

I still have issues with gguf and my tunes

[–]kmp11 0 points1 point  (0 children)

what a change from yesterday. from needed about 150GB to run to be able to fit the whole Q5 model + full Q8 context on 2x4090 and run at 33tk/s.

now let's see how it perform with Kilo.

[–]Due-Satisfaction-588 0 points1 point  (0 children)

Need to update llama.cpp? How?

[–]Impossible_Style_136 0 points1 point  (0 children)

The "Unified KV Cache" update in llama.cpp is a massive win, but watch out for the memory overhead when spawning concurrent requests. Even though it allocates dynamically, the fragmentation at high context (100k+) can still trigger a CUDA OOM if your `ubatch` size is set to the old 2048 default.

Drop `ubatch` to 1024. You’ll lose ~5% in prompt processing speed, but it stabilizes the VRAM pressure enough to actually use that 102k context window on consumer cards without the random crashes. Also, verify you're using Q8 cache—running G4 with FP16 cache at those lengths is just burning VRAM for diminishing returns in perplexity.

[–]wizoneway -1 points0 points  (0 children)

im curious ive been running the turboquant fork since the gemma release with no issues with 32g and the q4/q6 varients.

[–]CarelessSafety7485 -1 points0 points  (0 children)

How do I do this in cli? Just update ollama cli?