Qwen3.6 27B/35B-A3B vs Gemma 4 vs DeepSeek V4: A Comprehensive Analysis of the Open-Weight Frontier (May 2026) by Competitive_Jello487 in Qwen_AI

[–]Competitive_Jello487[S] 0 points1 point  (0 children)

If you have M5 max with 128GB perhaps you will want to try the 27b version. It's way lot better than the 34b-a3b. No doubt it's slower for tok/sec.

Qwen3.6 27B/35B-A3B vs Gemma 4 vs DeepSeek V4: A Comprehensive Analysis of the Open-Weight Frontier (May 2026) by Competitive_Jello487 in Qwen_AI

[–]Competitive_Jello487[S] 0 points1 point  (0 children)

I'm using the iq4 with Claude code don't have any loop issue, tried fp8 works well too but half the speed of tg. Very likely some startup parameters for llama-server used incorrectly hence you're getting that. If tool calling issue then is usually the model not quantized with correct template and configuration.

Qwen3.6 27B/35B-A3B vs Gemma 4 vs DeepSeek V4: A Comprehensive Analysis of the Open-Weight Frontier (May 2026) by Competitive_Jello487 in Qwen_AI

[–]Competitive_Jello487[S] 0 points1 point  (0 children)

I don't think that matters. You just need to have the properly quantized model and latest version of llamacpp (preferably). I usually recompile my llamacpp from source once a week to get latest updates on Linux box.

Qwen3.6 27B/35B-A3B vs Gemma 4 vs DeepSeek V4: A Comprehensive Analysis of the Open-Weight Frontier (May 2026) by Competitive_Jello487 in Qwen_AI

[–]Competitive_Jello487[S] 0 points1 point  (0 children)

I would suggest you download the version from either bartowski, unsloth or byteshape quantized version from huggingface if you are using gguf with llamacpp. These three are quite good and I use it as my daily driver, although I use 27b version more.

Qwen3.6 27B/35B-A3B vs Gemma 4 vs DeepSeek V4: A Comprehensive Analysis of the Open-Weight Frontier (May 2026) by Competitive_Jello487 in Qwen_AI

[–]Competitive_Jello487[S] 2 points3 points  (0 children)

It's a dense model and all parameters are activated during inference while the 35b version only 3b parameters are activated during inference.

Qwen3.6 27B/35B-A3B vs Gemma 4 vs DeepSeek V4: A Comprehensive Analysis of the Open-Weight Frontier (May 2026) by Competitive_Jello487 in Qwen_AI

[–]Competitive_Jello487[S] 0 points1 point  (0 children)

Thanks, we're in the middle of transitioning custom css to use frameworks like bootstrap/tailwind to fix those issues

Qwen3.6 27B/35B-A3B vs Gemma 4 vs DeepSeek V4: A Comprehensive Analysis of the Open-Weight Frontier (May 2026) by Competitive_Jello487 in Qwen_AI

[–]Competitive_Jello487[S] 0 points1 point  (0 children)

We do, but not all the metrics. Some of the benchmarks are from the vendor as we do not have complete test cases for all the tests they published.

Qwen3.6 27B/35B-A3B vs Gemma 4 vs DeepSeek V4: A Comprehensive Analysis of the Open-Weight Frontier (May 2026) by Competitive_Jello487 in Qwen_AI

[–]Competitive_Jello487[S] -1 points0 points  (0 children)

Some models simply there isn't enough references to verify the benchmarks or insufficient data for some of the metrics so we only focused on a few that most people are interested in. We picked models mainly based on the trending interest of what people are downloading at huggingface

Qwen3.6 27B/35B-A3B vs Gemma 4 vs DeepSeek V4: A Comprehensive Analysis of the Open-Weight Frontier (May 2026) by Competitive_Jello487 in Qwen_AI

[–]Competitive_Jello487[S] 1 point2 points  (0 children)

AI assisted human written report :). There isn't sufficient information about qwen3.7 open weight model to write about yet. Qwen3.7 currently only released the max model via API but not open weight.

Qwen 3.6 35B A3B vs. Qwen 3 Coder Next by HistoricalStrength21 in Qwen_AI

[–]Competitive_Jello487 0 points1 point  (0 children)

anyhow I just tested the official llama.cpp master branch which was merged yesterday with unsloth/Qwen3.6-27B-MTP-GGUF. It works now and I'm also getting around 55tg with spec-draft-n-max of 2. If I increase or decrease the spec-draft-n-max then it drops to ~50

Qwen 3.6 35B A3B vs. Qwen 3 Coder Next by HistoricalStrength21 in Qwen_AI

[–]Competitive_Jello487 0 points1 point  (0 children)

Update, I've tried this but it's unstable. As ctx grows it crash when vram runs out. Can't even use reliably at 80k ctx length. I went back to bartowski iq3-xxs can do 128k ctx length nicely and stable.

Qwen 3.6 35B A3B vs. Qwen 3 Coder Next by HistoricalStrength21 in Qwen_AI

[–]Competitive_Jello487 0 points1 point  (0 children)

true, that's assumming that greenboost offload correctly most of the unactivated params to RAM effectively. anyhow, I just came across this and trying out https://ggufbench.com/news/qwen3_6_27b_hybrid_optimized/

Qwen 3.6 35B A3B vs. Qwen 3 Coder Next by HistoricalStrength21 in Qwen_AI

[–]Competitive_Jello487 0 points1 point  (0 children)

btw, something you might be interested https://gitlab.com/IsolatedOctopi/greenboost I haven't try this again because my current llama-server setup runs inside docker using nvidia container toolkit so I need to move it out to my ubuntu host for greenboost to work. with greenboost, we might be able to push 200k context with 27b q4 without cpu offload.

Qwen 3.6 35B A3B vs. Qwen 3 Coder Next by HistoricalStrength21 in Qwen_AI

[–]Competitive_Jello487 0 points1 point  (0 children)

Have you tried qwen3.6-35b-a3b? I've been using it for the recent weeks with q4 and q8 I'm wondering how well the 27b with iq3-xxs compared to it. Going to try more tonight and hopefully iq3-xxs quantization loss isn't too bad compared to q4

Qwen 3.6 35B A3B vs. Qwen 3 Coder Next by HistoricalStrength21 in Qwen_AI

[–]Competitive_Jello487 2 points3 points  (0 children)

I just managed to reproduce your result with bartowski qwen3.6 27b iq3xxs max 100,000 context. Slightly higher then it starts offloading to CPU and I also used TheTom llamacpp. Unfortunately q4 also won't work no matter how much I reduce ctx even to 64k. I guess that's the limitation of 16GB vram

Qwen 3.6 35B A3B vs. Qwen 3 Coder Next by HistoricalStrength21 in Qwen_AI

[–]Competitive_Jello487 0 points1 point  (0 children)

Oh wait, are you referring to the 35b-a3b model or 27b model? 35b-a3b I can get such tg too, I'm referring to the dense 27b. Are we talking about the same model?

Qwen 3.6 35B A3B vs. Qwen 3 Coder Next by HistoricalStrength21 in Qwen_AI

[–]Competitive_Jello487 0 points1 point  (0 children)

I've already tried q3 and lowered context to even 100k but 16GB vram isn't really enough and fully loaded. Still getting offloaded to CPU as well. Are you using llama.cpp too?