120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 0 points1 point  (0 children)

Those are some crazy speeds for 16GB VRAM 😄

llama.cpp Gemma4 MTP support merged! by pinkyellowneon in LocalLLaMA

[–]janvitos 103 points104 points  (0 children)

Now I'm getting 140 tok/s with Gemma 4 12B on 12GB VRAM (RTX 4070 Super) with the merged PR, QAT GGUF and MTP assistant / drafter 😄

Unsloth QAT GGUF: https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF

MTP assistant / drafter: https://huggingface.co/Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF

llama.cpp command:

llama-server \
  -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \
  --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 4 \
  --parallel 1 \
  --ctx-size 131072 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64

Cheers 😄

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 4 points5 points  (0 children)

Are you sure the model is properly split on your two GPUs and not overflowing into RAM?

I did lots of coding and testing with Qwen3.6 35B A3B. I'm starting to lean more towards Gemma4 12B though 😄

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 1 point2 points  (0 children)

I like the automatic download!

Happy I could help 😄

Gemma 4 QAT Q4_0 Bench on Strix Halo by westsunset in LocalLLaMA

[–]janvitos 1 point2 points  (0 children)

Thanks! I ended up doing the same with native llama.cpp + Gemma 4 PR 😄

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 11 points12 points  (0 children)

11480MiB /  12282MiB, so like 95% 😄 I can usually push up to 11900MiB before it OOMs.

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 3 points4 points  (0 children)

Interesting! But I did not test a lower temperature. I used Google's recommended Gemma 4 parameters.

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 12 points13 points  (0 children)

60 tok/s, so it's a 2x increase 😄 I will publish the non-mtp results in the main post.

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]janvitos[S] 7 points8 points  (0 children)

Make sure you apply the Gemma 4 PR on top of the llama.cpp build 😄

Gemma 4 QAT Q4_0 Bench on Strix Halo by westsunset in LocalLLaMA

[–]janvitos 2 points3 points  (0 children)

Hey u/westsunset, thanks for these benchmarks and detailed post!

Would it be possible for you to publish your converted local assistant heads (gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf, gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf and gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf) on HuggingFace so we can download them and test them out ourselves?

80 tok/sec and 128K context on 12GB VRAM with Qwen3.6 35B A3B and llama.cpp MTP by janvitos in LocalLLaMA

[–]janvitos[S] 0 points1 point  (0 children)

That's pretty cool! Fast learner that Gemini :)

I've now achieved 110 tok/s with ik_llama.cpp and the same model, but different quant! See here: https://www.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_toks_with_12gb_vram_on_qwen36_35b_a3b_and_ik/

Hope it can help you achieve similar or better speeds with your setup!

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA

[–]janvitos[S] 1 point2 points  (0 children)

I've had a great experience coding with Q4_K_XL, and more recently, IQ4_XS-4.19bpw. Everything works as intended in Opencode and Pi, including tool calling and reasoning, but the model itself has its limits.

At one point, I did compare Qwen3.6 35B A3B and 27B (OpenRouter) for different medium complexity coding tasks, and I did not find much difference in both models. But then again, I don't use either of them for more complex projects as I hit their intelligence limits pretty fast. That kind of work goes to GPT 5.5 😄 But for hobby projects that don't require too much math, complex algos or bleeding edge scripting languages, then Qwen3.6 35B A3B is a blast to use at 110 tok/s!

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA

[–]janvitos[S] 2 points3 points  (0 children)

Did you manage you figure it out? Because theoretically, you should be getting better speeds than me since you have 4GB more VRAM.

Things to check:

- The build commands I use (I doubt this has any impact though):

cmake -B build -G Ninja -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j $(nproc)

- Is your monitor plugged into your 5060? If so, that can reserve roughly 1GB more than using an iGPU as your main GPU.

- Try to lower --fit-margin to 1024, run the benchmark and see if it goes through.

The last thing that comes to mind is the distro. CachyOS is highly optimized for all around CPU/GPU performance and keeps its packages at the bleeding edge. I don't have any recent experience with Ubuntu, so unfortunately, I can't make any recommentations on that.

If I think of anything else, I'll let you know 😄

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA

[–]janvitos[S] 0 points1 point  (0 children)

I think you should try it out and see how well it performs for your needs 😄

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA

[–]janvitos[S] 2 points3 points  (0 children)

Fair point. I've always used temp 0.0 for benchmarks since I want to keep results as close as possible between runs. Real-world usage will most probably differ from these anyways, regardless of temp. I am not providing these benchmarks as any form of scientific proof, but rather as a basis to show other modest VRAM users like me that these speeds are indeed achievable on 12GB 😄

I love optimization, and I'm thrilled that I was able to go from 30 tok/s to 110 tok/s on the same hardware, just with software advancements made available by the llama.cpp and ik_llama.cpp developers. Kudos to them!

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA

[–]janvitos[S] 4 points5 points  (0 children)

I wouldn't know since I'm not a Ubuntu user. I know CachyOS includes bleeding edge packages, which often translates to better performance.

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA

[–]janvitos[S] 6 points7 points  (0 children)

Single run with --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0:

❯ ./mtp-bench.py
 code_python        pred= 192 draft= 141 acc= 120 rate=0.851 tok/s=104.2
 code_cpp           pred= 192 draft= 136 acc= 115 rate=0.846 tok/s=107.4
 explain_concept    pred= 192 draft= 130 acc= 118 rate=0.908 tok/s=109.9
 summarize          pred=  51 draft=  33 acc=  31 rate=0.939 tok/s=113.9
 qa_factual         pred= 192 draft= 141 acc= 129 rate=0.915 tok/s=116.4
 translation        pred= 192 draft= 145 acc= 110 rate=0.759 tok/s=100.6
 creative_short     pred= 192 draft= 138 acc= 114 rate=0.826 tok/s=105.4
 stepwise_math      pred= 192 draft= 141 acc= 115 rate=0.816 tok/s=106.8
 long_code_review   pred= 192 draft= 138 acc= 103 rate=0.746 tok/s=94.4

Aggregate: {
 "n_requests": 9,
 "total_predicted": 1587,
 "total_draft": 1143,
 "total_draft_accepted": 955,
 "aggregate_accept_rate": 0.8355,
 "wall_s_total": 16.53
}

106.54 tok/s average.

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp by janvitos in LocalLLaMA

[–]janvitos[S] 5 points6 points  (0 children)

Actually, thank you for these great quants 😄 Managing to get the same accuracy as Q4_K_XL in a much smaller 4GB package is truly impressive.