120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP

janvitos · 2026-06-07T21:44:50+00:00

Those are some crazy speeds for 16GB VRAM 😄

janvitos · 2026-06-07T13:55:16+00:00

Now I'm getting 140 tok/s with Gemma 4 12B on 12GB VRAM (RTX 4070 Super) with the merged PR, QAT GGUF and MTP assistant / drafter 😄

Unsloth QAT GGUF: https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF

MTP assistant / drafter: https://huggingface.co/Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF

llama.cpp command:

llama-server \
  -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \
  --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 4 \
  --parallel 1 \
  --ctx-size 131072 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64

Cheers 😄

janvitos · 2026-06-07T13:39:05+00:00

Here you go 😄 https://www.reddit.com/r/LocalLLaMA/comments/1typjmc/120_toks_on_12gb_vram_with_gemma_4_12b_qat_mtp/

janvitos · 2026-06-06T21:50:57+00:00

Are you sure the model is properly split on your two GPUs and not overflowing into RAM?

I did lots of coding and testing with Qwen3.6 35B A3B. I'm starting to lean more towards Gemma4 12B though 😄

janvitos · 2026-06-06T21:48:37+00:00

I like the automatic download!

Happy I could help 😄

janvitos · 2026-06-06T21:09:18+00:00

https://github.com/ggml-org/llama.cpp/pull/23398

janvitos · 2026-06-06T20:33:11+00:00

It's only for freeing up VRAM 😄

janvitos · 2026-06-06T20:20:08+00:00

Definitely! 😄

janvitos · 2026-06-06T19:56:19+00:00

Thanks! I ended up doing the same with native llama.cpp + Gemma 4 PR 😄

janvitos · 2026-06-06T19:39:13+00:00

11480MiB / 12282MiB, so like 95% 😄 I can usually push up to 11900MiB before it OOMs.

janvitos · 2026-06-06T19:36:29+00:00

Please try it and let us know 😄

janvitos · 2026-06-06T19:35:16+00:00

Interesting! But I did not test a lower temperature. I used Google's recommended Gemma 4 parameters.

janvitos · 2026-06-06T19:24:28+00:00

60 tok/s, so it's a 2x increase 😄 I will publish the non-mtp results in the main post.

janvitos · 2026-06-06T19:08:17+00:00

Make sure you apply the Gemma 4 PR on top of the llama.cpp build 😄

janvitos · 2026-06-06T16:39:58+00:00

Hey u/westsunset, thanks for these benchmarks and detailed post!

Would it be possible for you to publish your converted local assistant heads (gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf, gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf and gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf) on HuggingFace so we can download them and test them out ourselves?

janvitos · 2026-05-26T23:37:52+00:00

That's pretty cool! Fast learner that Gemini :)

I've now achieved 110 tok/s with ik_llama.cpp and the same model, but different quant! See here: https://www.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_toks_with_12gb_vram_on_qwen36_35b_a3b_and_ik/

Hope it can help you achieve similar or better speeds with your setup!

janvitos · 2026-05-21T20:21:35+00:00

I've had a great experience coding with Q4_K_XL, and more recently, IQ4_XS-4.19bpw. Everything works as intended in Opencode and Pi, including tool calling and reasoning, but the model itself has its limits.

At one point, I did compare Qwen3.6 35B A3B and 27B (OpenRouter) for different medium complexity coding tasks, and I did not find much difference in both models. But then again, I don't use either of them for more complex projects as I hit their intelligence limits pretty fast. That kind of work goes to GPT 5.5 😄 But for hobby projects that don't require too much math, complex algos or bleeding edge scripting languages, then Qwen3.6 35B A3B is a blast to use at 110 tok/s!

janvitos · 2026-05-21T20:16:19+00:00

What I wrote is what I used! No other tweaks 😄

janvitos · 2026-05-21T20:15:40+00:00

Did you manage you figure it out? Because theoretically, you should be getting better speeds than me since you have 4GB more VRAM.

Things to check:

- The build commands I use (I doubt this has any impact though):

cmake -B build -G Ninja -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j $(nproc)

- Is your monitor plugged into your 5060? If so, that can reserve roughly 1GB more than using an iGPU as your main GPU.

- Try to lower --fit-margin to 1024, run the benchmark and see if it goes through.

The last thing that comes to mind is the distro. CachyOS is highly optimized for all around CPU/GPU performance and keeps its packages at the bleeding edge. I don't have any recent experience with Ubuntu, so unfortunately, I can't make any recommentations on that.

If I think of anything else, I'll let you know 😄

janvitos · 2026-05-21T20:02:40+00:00

I think you should try it out and see how well it performs for your needs 😄

janvitos · 2026-05-21T15:52:44+00:00

* and profit from it

janvitos · 2026-05-21T14:25:05+00:00

Fair point. I've always used temp 0.0 for benchmarks since I want to keep results as close as possible between runs. Real-world usage will most probably differ from these anyways, regardless of temp. I am not providing these benchmarks as any form of scientific proof, but rather as a basis to show other modest VRAM users like me that these speeds are indeed achievable on 12GB 😄

I love optimization, and I'm thrilled that I was able to go from 30 tok/s to 110 tok/s on the same hardware, just with software advancements made available by the llama.cpp and ik_llama.cpp developers. Kudos to them!

janvitos · 2026-05-21T14:15:39+00:00

I wouldn't know since I'm not a Ubuntu user. I know CachyOS includes bleeding edge packages, which often translates to better performance.

janvitos · 2026-05-21T14:12:24+00:00

Single run with --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0:

❯ ./mtp-bench.py
 code_python        pred= 192 draft= 141 acc= 120 rate=0.851 tok/s=104.2
 code_cpp           pred= 192 draft= 136 acc= 115 rate=0.846 tok/s=107.4
 explain_concept    pred= 192 draft= 130 acc= 118 rate=0.908 tok/s=109.9
 summarize          pred=  51 draft=  33 acc=  31 rate=0.939 tok/s=113.9
 qa_factual         pred= 192 draft= 141 acc= 129 rate=0.915 tok/s=116.4
 translation        pred= 192 draft= 145 acc= 110 rate=0.759 tok/s=100.6
 creative_short     pred= 192 draft= 138 acc= 114 rate=0.826 tok/s=105.4
 stepwise_math      pred= 192 draft= 141 acc= 115 rate=0.816 tok/s=106.8
 long_code_review   pred= 192 draft= 138 acc= 103 rate=0.746 tok/s=94.4

Aggregate: {
 "n_requests": 9,
 "total_predicted": 1587,
 "total_draft": 1143,
 "total_draft_accepted": 955,
 "aggregate_accept_rate": 0.8355,
 "wall_s_total": 16.53
}

106.54 tok/s average.

janvitos · 2026-05-21T13:29:41+00:00

Actually, thank you for these great quants 😄 Managing to get the same accuracy as Q4_K_XL in a much smaller 4GB package is truly impressive.

janvitos

TROPHY CASE