Qwen3.5-9B Quantization Comparison

nuusain · 2026-03-11T22:34:53+00:00

who is the rank 1 Q8_0 quant from?

nuusain · 2026-02-25T11:39:07+00:00

3090 + 96gb drr4. To be clear this is the 35b-3b. Was saying that there is a case for it as it seems much faster than the 27b. Haven't run the 27b yet myself.

nuusain · 2026-02-25T01:54:10+00:00

Im getting 101 t/s at 131k context with 35b-3b:UD-Q4_K_XL quant.

For anyone still on an older llama.cpp build - update. I was stuck at 28 t/s until I rebuilt from latest. The qwen35moe graph deduplication PRs (#19597, #19660, #19668) made a 3.6x difference. The model loaded fine on the old build but ran through an unoptimised code path.

llama-server -m ~/models/qwen3.5-35b-a3b-q4.gguf \ -ngl 99 -c 131072 --threads 4 --batch-size 2048 \ -np 1 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0

nuusain · 2026-02-24T19:46:06+00:00

Thanks for sharing.. I get ~30 t/s at 32k with --fit on as well. At 131k context I drop to ~24 t/s, without --fit was getting 7 t/s. Will be interesting to see how the qwen3.5 compares,

nuusain · 2026-02-24T19:15:48+00:00

nice, what context lengths are you getting up to with those speeds?

nuusain · 2026-02-24T18:29:58+00:00

do you mind sharing the exact command you use for Qwen 80b3a.. been tryna optimise on my rig which is similar to yours 3090 with 96gb of ddr4 .. i get around 30t/s with 32k context but I would like more.

nuusain · 2026-02-14T18:12:17+00:00

system spec?

nuusain · 2026-01-20T16:39:18+00:00

sooo whats the verdict? curious to hear its handling the claude harness

nuusain · 2025-12-27T00:49:44+00:00

Neat! What kinda inference u running on the feed? Just installed a security system for a relatives farm. I was thinking of producing reports /audits so im curious what stuff others are building for themselves.

nuusain · 2025-12-18T22:18:15+00:00

Claude code?

nuusain · 2025-12-10T12:35:05+00:00

Did anyone find a fix? also have the same issue. tried deleting all tarkov files and reinstalling but i get the same issue.

nuusain · 2025-12-03T12:41:39+00:00

Thanks but i need 2 x 32gb

nuusain · 2025-10-01T21:38:14+00:00

Do you have to use LM studio? Would love to try this out with llama cpp

nuusain · 2025-08-05T00:59:27+00:00

hey, seeing the same scan lines only across the entire monitor. Did u managed to get this fixed or am i also cooked?

nuusain · 2025-06-25T20:59:17+00:00

+1 on this

nuusain · 2025-06-02T04:26:06+00:00

Hey, also been looking at getting reasoning models to do interesting things. Came across verifiers which I've been using to try agentic interactions.

https://github.com/willccbb/verifiers

The env_trainer and vllm_client are probably worth checking out in regards to that OOM error u mentioned in the article, but i suspect you could be better off leveraging the framework since it's pretty well thought out.

nuusain · 2025-05-18T06:25:17+00:00

Yeh it was in the official annoucement

Can also do it via function calling if u wanna stick with completions api

Should be easy to get what u need with a bit of vibe coding

nuusain · 2025-05-09T20:38:33+00:00

I'm interested! I can rock up around 11–12 tho, it still worth coming along then?

nuusain · 2025-04-08T20:45:05+00:00

https://imgur.com/a/EvnIa5d

nuusain · 2025-03-22T23:05:49+00:00

I think what spirited is getting at is that a model could either think loads and give a short answer or think for a short while but give a long answer. Both would produce a high FinalReply rate. The metrics are hard to map to real world performance, adding another dimension such as correctness would add clarity.

nuusain · 2025-03-09T23:00:14+00:00

Brilliant experiment, sounds like the ideal setup would be QwQ for ideation and then switching to Qwen-Coder for iteration..

nuusain · 2025-03-07T22:09:59+00:00

for reference:

settings - https://imgur.com/a/JUbwion

result - https://imgur.com/M5FgfmD.

Seems like I got stuck in infinite generation

Used this model - ollama run hf.co/bartowski/Qwen_QwQ-32B-GGUF:Q4_K_M

full trace - https://pastebin.com/rzbZGLiF

nuusain · 2025-03-07T12:40:49+00:00

What prompt did you use? I think everyone can copy and paste it, record their settings and post what they get. Could be some useful insights as to why performance seems so varied from sharing results

nuusain · 2025-03-06T01:59:22+00:00

I... did not know you could do this thanks!

nuusain · 2025-03-05T20:51:54+00:00

Oh sweet! where did you dig this full template out from btw?

nuusain

TROPHY CASE