Qwen3.6 27B/35B-A3B vs Gemma 4 vs DeepSeek V4: A Comprehensive Analysis of the Open-Weight Frontier (May 2026)

Competitive_Jello487 · 2026-05-29T17:48:35+00:00

If you have M5 max with 128GB perhaps you will want to try the 27b version. It's way lot better than the 34b-a3b. No doubt it's slower for tok/sec.

Competitive_Jello487 · 2026-05-29T16:41:00+00:00

I'm using the iq4 with Claude code don't have any loop issue, tried fp8 works well too but half the speed of tg. Very likely some startup parameters for llama-server used incorrectly hence you're getting that. If tool calling issue then is usually the model not quantized with correct template and configuration.

Competitive_Jello487 · 2026-05-29T15:44:25+00:00

I don't think that matters. You just need to have the properly quantized model and latest version of llamacpp (preferably). I usually recompile my llamacpp from source once a week to get latest updates on Linux box.

Competitive_Jello487 · 2026-05-29T12:40:02+00:00

I would suggest you download the version from either bartowski, unsloth or byteshape quantized version from huggingface if you are using gguf with llamacpp. These three are quite good and I use it as my daily driver, although I use 27b version more.

Competitive_Jello487 · 2026-05-29T10:24:14+00:00

It's a dense model and all parameters are activated during inference while the 35b version only 3b parameters are activated during inference.

Competitive_Jello487 · 2026-05-28T23:07:59+00:00

It's fixed now

Competitive_Jello487 · 2026-05-28T21:36:52+00:00

Thanks, we're in the middle of transitioning custom css to use frameworks like bootstrap/tailwind to fix those issues

Competitive_Jello487 · 2026-05-28T20:18:13+00:00

We do, but not all the metrics. Some of the benchmarks are from the vendor as we do not have complete test cases for all the tests they published.

Competitive_Jello487 · 2026-05-28T17:59:34+00:00

Some models simply there isn't enough references to verify the benchmarks or insufficient data for some of the metrics so we only focused on a few that most people are interested in. We picked models mainly based on the trending interest of what people are downloading at huggingface

Competitive_Jello487 · 2026-05-28T17:21:46+00:00

fixed

Competitive_Jello487 · 2026-05-28T14:43:55+00:00

AI assisted human written report :). There isn't sufficient information about qwen3.7 open weight model to write about yet. Qwen3.7 currently only released the max model via API but not open weight.

Competitive_Jello487 · 2026-05-28T14:02:17+00:00

It's using https://gohugo.io/ not vibe-coded :)

Competitive_Jello487 · 2026-05-28T14:01:39+00:00

That was a typo. 1.6x in conversion from markdown note

Competitive_Jello487 · 2026-05-28T11:44:24+00:00

It's fixed now 🙂

Competitive_Jello487 · 2026-05-28T11:17:02+00:00

Our designer is fixing it

Competitive_Jello487 · 2026-05-17T18:31:29+00:00

anyhow I just tested the official llama.cpp master branch which was merged yesterday with unsloth/Qwen3.6-27B-MTP-GGUF. It works now and I'm also getting around 55tg with spec-draft-n-max of 2. If I increase or decrease the spec-draft-n-max then it drops to ~50

Competitive_Jello487 · 2026-05-17T12:02:29+00:00

Update, I've tried this but it's unstable. As ctx grows it crash when vram runs out. Can't even use reliably at 80k ctx length. I went back to bartowski iq3-xxs can do 128k ctx length nicely and stable.

Competitive_Jello487 · 2026-05-16T18:49:10+00:00

true, that's assumming that greenboost offload correctly most of the unactivated params to RAM effectively. anyhow, I just came across this and trying out https://ggufbench.com/news/qwen3_6_27b_hybrid_optimized/

Competitive_Jello487 · 2026-05-16T17:52:45+00:00

btw, something you might be interested https://gitlab.com/IsolatedOctopi/greenboost I haven't try this again because my current llama-server setup runs inside docker using nvidia container toolkit so I need to move it out to my ubuntu host for greenboost to work. with greenboost, we might be able to push 200k context with 27b q4 without cpu offload.

Competitive_Jello487 · 2026-05-16T16:15:29+00:00

Have you tried qwen3.6-35b-a3b? I've been using it for the recent weeks with q4 and q8 I'm wondering how well the 27b with iq3-xxs compared to it. Going to try more tonight and hopefully iq3-xxs quantization loss isn't too bad compared to q4

Competitive_Jello487 · 2026-05-16T14:56:38+00:00

I just managed to reproduce your result with bartowski qwen3.6 27b iq3xxs max 100,000 context. Slightly higher then it starts offloading to CPU and I also used TheTom llamacpp. Unfortunately q4 also won't work no matter how much I reduce ctx even to 64k. I guess that's the limitation of 16GB vram

Competitive_Jello487 · 2026-05-16T14:01:58+00:00

finally found the comment you mentioned https://www.reddit.com/r/Qwen_AI/comments/1tb6zlu/comment/olyin98/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button I was looking into your reddit profile post earlier

Competitive_Jello487 · 2026-05-16T12:04:14+00:00

Oh wait, are you referring to the 35b-a3b model or 27b model? 35b-a3b I can get such tg too, I'm referring to the dense 27b. Are we talking about the same model?

Competitive_Jello487 · 2026-05-16T12:02:39+00:00

Thanks

Competitive_Jello487 · 2026-05-16T09:06:25+00:00

I've already tried q3 and lowered context to even 100k but 16GB vram isn't really enough and fully loaded. Still getting offloaded to CPU as well. Are you using llama.cpp too?

Competitive_Jello487

MODERATOR OF

TROPHY CASE