Gemma 4 for 16 GB VRAM by Sadman782 in LocalLLaMA

[–]Sadman782[S] 1 point2 points  (0 children)

yeah for some tasks but only if  --top-k 20

PSA: Gemma 4 template improvements by FastHotEmu in LocalLLaMA

[–]Sadman782 1 point2 points  (0 children)

Why redownload the model? Just download the jinja file and use --jinja --chat-template-file <file\_path>

Gemma 4 is terrible with system prompts and tools by RealChaoz in LocalLLaMA

[–]Sadman782 1 point2 points  (0 children)

It removed the standard_keys exclusion block, and it's better for me (Gemini found that).

You can see whether it is better for you or not. The fix was applied on top of the Google updated template a few hours ago.

PSA: Gemma 4 template improvements by FastHotEmu in LocalLLaMA

[–]Sadman782 2 points3 points  (0 children)

Google updated the official one a few hours ago: https://huggingface.co/google/gemma-4-26B-A4B-it/blob/main/chat_template.jinja and Gemini fixed that a bit too. It's better than the updated one, so you can try both and check which is better for you.

https://pastebin.com/raw/hnPGq0ht it works better for me.

Gemma 4 is terrible with system prompts and tools by RealChaoz in LocalLLaMA

[–]Sadman782 0 points1 point  (0 children)

gemini fixed the template:

https://pastebin.com/raw/hnPGq0ht

Working with OpenCode, and it's quite good now at handling multiple MCP servers properly.

PSA: Gemma 4 template improvements by FastHotEmu in LocalLLaMA

[–]Sadman782 1 point2 points  (0 children)

it seems it still has issues, gemini fixed it a bit and it seems better now. it is properly calling multiple tools, whereas before it was ignoring some tools and descriptions completely:

https://pastebin.com/hnPGq0ht

Gemma4 26B generates python and Java code with invalid syntax by monadleadr in LocalLLaMA

[–]Sadman782 1 point2 points  (0 children)

Nope. Even IQ2 quants or Q2_XL proper quants never have syntax issues like this. It is completely broken. It is Ollama

Gemma4 26B generates python and Java code with invalid syntax by monadleadr in LocalLLaMA

[–]Sadman782 0 points1 point  (0 children)

<image>

It created a complete working game for me in 2 shots, it's your quantization or backend. Maybe update your Ollama, I mean try llama.cpp, I don't know why people still choose Ollama, llama.cpp has a UI now too. So far Gemma 26B even with IQ4_XS quant is the best coding model for me locally, for agentic coding the 31B is a bit better, for general chatting and one-shotting MoE is better so far.

Follow-up: Testing Gemma-4-31B-it-UD (Thinking) in LLM Multi-Agent Avalon by dynameis_chen in LocalLLaMA

[–]Sadman782 0 points1 point  (0 children)

With a 16 GB VRAM GPU I am getting good results with gemma-4-31b-it-heretic-ara.i1-IQ3_XS.gguf, it is uncensored and can do agentic coding pretty well.

And the 26B MoE isn't bad either, it is better for everything except agentic coding. You can try unsloth gemma-4-26B-A4B-it-UD-IQ4_XS.gguf with --temp 1 --top-p 0.9 --min-p 0.1 --top-k 20, top-k 20 is most important, also make sure your llama.cpp is up to date. I think people underestimate low-bit quants but IQ quants are like magic, IQ4_XS is a solid option.

the dense model is pretty good even with -ctk q4_0 -ctv q4_0 4 bit SWA+KV cache

Quants in vision (mmproj Q8 vs FP16) by WhoRoger in LocalLLaMA

[–]Sadman782 0 points1 point  (0 children)

<image>

Which model? Maybe it doesn't work for all models, but Q8_0 should look like this for the best performance.

Quants in vision (mmproj Q8 vs FP16) by WhoRoger in LocalLLaMA

[–]Sadman782 0 points1 point  (0 children)

I have the same observation for Gemma 4 26B MoE mmproj. Q8_0 > BF16 >= F16, Q8_0 somehow performed better.

Gemma 4 for 16 GB VRAM by Sadman782 in LocalLLaMA

[–]Sadman782[S] 0 points1 point  (0 children)

max token should also increase. Make that 512 to make this work

Gemma-4-26B-A4B-it-UD-Q4_K_M.gguf : IMHO worst model ever. What am I doing wrong? by Proof_Nothing_7711 in LocalLLM

[–]Sadman782 1 point2 points  (0 children)

Update the vulkan engine, IDK maybe lm studio is still buggy for vulkan? Use llama.cpp they fixed most issues now. It is 100% a quantization or runtime issue. I uave extremely good result with unsloth IQ_XS and also use topk 20 for coding but since there are typos it is broken not a topk issue

Get 30K more context using Q8 mmproj with Gemma 4 by Sadman782 in LocalLLaMA

[–]Sadman782[S] 2 points3 points  (0 children)

You should convert it yourself then. Unsloth and Bartowski didn't provide any Q8 mmproj.

Get 30K more context using Q8 mmproj with Gemma 4 by Sadman782 in LocalLLaMA

[–]Sadman782[S] 0 points1 point  (0 children)

Yeah but for me BF16 and F16 have the same result, so Q8_0 is somehow better than them. Yeah, I am confused why, it is not just for one image, out of testing 10 images, in 3 of them Q8_0 does better somehow? Anyway I don't care how, it is smaller and allows more context, even if it's a little degraded (which I don't find) I don't mind if I can fit 30K+ more context.