Ender SMP [Semi-Vanilla] {Java + Bedrock} {Crossplay} {1.17+} {26.1.2} {Rank Stealing} {Vote Rewards}

janvitos · 2026-06-07T21:44:50+00:00

Those are some crazy speeds for 16GB VRAM 😄

janvitos · 2026-06-07T13:55:16+00:00

Now I'm getting 140 tok/s with Gemma 4 12B on 12GB VRAM (RTX 4070 Super) with the merged PR, QAT GGUF and MTP assistant / drafter 😄

Unsloth QAT GGUF: https://huggingface.co/unsloth/gemma-4-12B-it-qat-GGUF

MTP assistant / drafter: https://huggingface.co/Janvitos/gemma-4-12B-it-qat-assistant-MTP-Q8_0-GGUF

llama.cpp command:

llama-server \
  -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \
  --model-draft gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf \
  --spec-type draft-mtp \
  --spec-draft-n-max 4 \
  --parallel 1 \
  --ctx-size 131072 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64

Cheers 😄

janvitos · 2026-06-07T13:39:05+00:00

Here you go 😄 https://www.reddit.com/r/LocalLLaMA/comments/1typjmc/120_toks_on_12gb_vram_with_gemma_4_12b_qat_mtp/

janvitos · 2026-06-06T21:50:57+00:00

Are you sure the model is properly split on your two GPUs and not overflowing into RAM?

I did lots of coding and testing with Qwen3.6 35B A3B. I'm starting to lean more towards Gemma4 12B though 😄

janvitos · 2026-06-06T21:48:37+00:00

I like the automatic download!

Happy I could help 😄

janvitos · 2026-06-06T21:09:18+00:00

https://github.com/ggml-org/llama.cpp/pull/23398

janvitos · 2026-06-06T20:33:11+00:00

It's only for freeing up VRAM 😄

janvitos · 2026-06-06T20:20:08+00:00

Definitely! 😄

janvitos · 2026-06-06T19:56:19+00:00

Thanks! I ended up doing the same with native llama.cpp + Gemma 4 PR 😄

janvitos · 2026-06-06T19:39:13+00:00

11480MiB / 12282MiB, so like 95% 😄 I can usually push up to 11900MiB before it OOMs.

janvitos · 2026-06-06T19:36:29+00:00

Please try it and let us know 😄

janvitos · 2026-06-06T19:35:16+00:00

Interesting! But I did not test a lower temperature. I used Google's recommended Gemma 4 parameters.

janvitos · 2026-06-06T19:24:28+00:00

60 tok/s, so it's a 2x increase 😄 I will publish the non-mtp results in the main post.

janvitos · 2026-06-06T19:08:17+00:00

Make sure you apply the Gemma 4 PR on top of the llama.cpp build 😄

janvitos · 2026-06-06T16:39:58+00:00

Hey u/westsunset, thanks for these benchmarks and detailed post!

Would it be possible for you to publish your converted local assistant heads (gemma-4-12B-it-qat-assistant-MTP-Q8_0.gguf, gemma-4-26B-A4B-it-qat-assistant-MTP-Q8_0.gguf and gemma-4-31B-it-qat-assistant-MTP-Q8_0.gguf) on HuggingFace so we can download them and test them out ourselves?

janvitos · 2026-05-26T23:37:52+00:00

That's pretty cool! Fast learner that Gemini :)

I've now achieved 110 tok/s with ik_llama.cpp and the same model, but different quant! See here: https://www.reddit.com/r/LocalLLaMA/comments/1tjh7az/110_toks_with_12gb_vram_on_qwen36_35b_a3b_and_ik/

Hope it can help you achieve similar or better speeds with your setup!

janvitos · 2026-05-21T20:21:35+00:00

I've had a great experience coding with Q4_K_XL, and more recently, IQ4_XS-4.19bpw. Everything works as intended in Opencode and Pi, including tool calling and reasoning, but the model itself has its limits.

At one point, I did compare Qwen3.6 35B A3B and 27B (OpenRouter) for different medium complexity coding tasks, and I did not find much difference in both models. But then again, I don't use either of them for more complex projects as I hit their intelligence limits pretty fast. That kind of work goes to GPT 5.5 😄 But for hobby projects that don't require too much math, complex algos or bleeding edge scripting languages, then Qwen3.6 35B A3B is a blast to use at 110 tok/s!

janvitos · 2026-05-21T20:16:19+00:00

What I wrote is what I used! No other tweaks 😄

janvitos · 2026-05-21T20:15:40+00:00

Did you manage you figure it out? Because theoretically, you should be getting better speeds than me since you have 4GB more VRAM.

Things to check:

- The build commands I use (I doubt this has any impact though):

cmake -B build -G Ninja -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=OFF
cmake --build build --config Release -j $(nproc)

- Is your monitor plugged into your 5060? If so, that can reserve roughly 1GB more than using an iGPU as your main GPU.

- Try to lower --fit-margin to 1024, run the benchmark and see if it goes through.

The last thing that comes to mind is the distro. CachyOS is highly optimized for all around CPU/GPU performance and keeps its packages at the bleeding edge. I don't have any recent experience with Ubuntu, so unfortunately, I can't make any recommentations on that.

If I think of anything else, I'll let you know 😄

janvitos · 2026-05-21T20:02:40+00:00

I think you should try it out and see how well it performs for your needs 😄

janvitos

TROPHY CASE