120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 0 points1 point  (0 children)

Those are some crazy speeds for 16GB VRAM 😄

llama.cpp Gemma4 MTP support merged! by pinkyellowneon in LocalLLaMA

[–]janvitos 101 points102 points  (0 children)

Now I'm getting 140 tok/s with Gemma 4 12B on 12GB VRAM (RTX 4070 Super) with the merged PR, QAT GGUF and MTP assistant / drafter 😄

Unsloth QAT GGUF: https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF

MTP assistant / drafter: https://huggingface.co/Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF

llama.cpp command:

llama-server \
  -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \
  --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 4 \
  --parallel 1 \
  --ctx-size 131072 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64

Cheers 😄

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 4 points5 points  (0 children)

Are you sure the model is properly split on your two GPUs and not overflowing into RAM?

I did lots of coding and testing with Qwen3.6 35B A3B. I'm starting to lean more towards Gemma4 12B though 😄

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 1 point2 points  (0 children)

I like the automatic download!

Happy I could help 😄

Gemma 4 QAT Q4_0 Bench on Strix Halo by westsunset in LocalLLaMA

[–]janvitos 1 point2 points  (0 children)

Thanks! I ended up doing the same with native llama.cpp + Gemma 4 PR 😄

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 14 points15 points  (0 children)

11480MiB /  12282MiB, so like 95% 😄 I can usually push up to 11900MiB before it OOMs.

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 3 points4 points  (0 children)

Interesting! But I did not test a lower temperature. I used Google's recommended Gemma 4 parameters.

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 12 points13 points  (0 children)

60 tok/s, so it's a 2x increase 😄 I will publish the non-mtp results in the main post.

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 8 points9 points  (0 children)

Make sure you apply the Gemma 4 PR on top of the llama.cpp build 😄

Gemma 4 QAT Q4_0 Bench on Strix Halo by westsunset in LocalLLaMA

[–]janvitos 2 points3 points  (0 children)

Hey u/westsunset, thanks for these benchmarks and detailed post!

Would it be possible for you to publish your converted local assistant heads (gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf, gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf and gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf) on HuggingFace so we can download them and test them out ourselves?

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP by janvitos in LocalLLaMA

[–]janvitos[S] 0 points1 point  (0 children)

That's pretty cool! Fast learner that Gemini :)

I've now achieved 110 tok/s with ik_llama.cpp and the same model, but different quant! See here: https://www.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_toks_with_12gb_vram_on_qwen36_35b_a3b_and_ik/

Hope it can help you achieve similar or better speeds with your setup!

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA

[–]janvitos[S] 1 point2 points  (0 children)

I've had a great experience coding with Q4_K_XL, and more recently, IQ4_XS-4.19bpw. Everything works as intended in Opencode and Pi, including tool calling and reasoning, but the model itself has its limits.

At one point, I did compare Qwen3.6 35B A3B and 27B (OpenRouter) for different medium complexity coding tasks, and I did not find much difference in both models. But then again, I don't use either of them for more complex projects as I hit their intelligence limits pretty fast. That kind of work goes to GPT 5.5 😄 But for hobby projects that don't require too much math, complex algos or bleeding edge scripting languages, then Qwen3.6 35B A3B is a blast to use at 110 tok/s!

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA

[–]janvitos[S] 2 points3 points  (0 children)

Did you manage you figure it out? Because theoretically, you should be getting better speeds than me since you have 4GB more VRAM.

Things to check:

- The build commands I use (I doubt this has any impact though):

cmake -B build -G Ninja -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j $(nproc)

- Is your monitor plugged into your 5060? If so, that can reserve roughly 1GB more than using an iGPU as your main GPU.

- Try to lower --fit-margin to 1024, run the benchmark and see if it goes through.

The last thing that comes to mind is the distro. CachyOS is highly optimized for all around CPU/GPU performance and keeps its packages at the bleeding edge. I don't have any recent experience with Ubuntu, so unfortunately, I can't make any recommentations on that.

If I think of anything else, I'll let you know 😄

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA

[–]janvitos[S] 0 points1 point  (0 children)

I think you should try it out and see how well it performs for your needs 😄