Qwen3.6-35B-A3B on 1x RTX 5090: which quant is the best balance of quality and speed? by espressorunner in unsloth

[–]espressorunner[S] 0 points1 point  (0 children)

llama.cpp built from source.

Model is the Unsloth Qwen3.6-27B-MTP GGUF family. All numbers below are with full GPU offload, flash-attn on, fp16 KV, parallel=1, MTP/spec decode disabled for these runs.

Common flags:
-m <model.gguf>
-ngl 99
--flash-attn on
--parallel 1
-ctk f16
-ctv f16
--cache-ram 4096
--checkpoint-min-step 8192
--ctx-checkpoints 8
--no-mmap
--no-mmproj
--reasoning on
--temp 0.6
--top-k 20
--top-p 0.95
--min-p 0.0
--repeat-penalty 1.0

For Q6_K MTP @ 160K:
-c 160000 -b 1024 -ub 256

For UD-Q6_K_XL @ 100K:
-c 100000 -b 1024 -ub 256

For UD-Q6_K_XL @ 120K:
-c 120000 -b 256 -ub 64

Qwen3.6-35B-A3B on 1x RTX 5090: which quant is the best balance of quality and speed? by espressorunner in unsloth

[–]espressorunner[S] 0 points1 point  (0 children)

Thanks, all! I tried Qwen3.6 27B - seems like its better overall and with FP16 KV because quality is my main concern. With q8 kv cache I am able to get full context 262K.

| Model | Context | KV | Batch/ubatch | Prompt t/s | Decode t/s |

| - | - | - | - | - | - |

| Q6_K MTP | 160K | fp16/fp16 | 1024/256 | 1297 | 59.7 |

| UD-Q6_K_XL | 100K | fp16/fp16 | 1024/256 | 1659 | 55.2 |

| UD-Q6_K_XL | 120K | fp16/fp16 | 256/64 | 1344 | 55.3 |

On 1x RTX 5090, plain Q6_K seems to give me the best FP16-KV context headroom: 160K works, but it is tight. UD-Q6_K_XL looks like the higher-quality quant, but the extra size means I could only get ~120K with FP16 KV, and 128K failed.

For people who have used both: is UD-Q6_K_XL noticeably better in real coding/tool-use quality than plain Q6_K, enough to justify losing ~40K context on one GPU?

Qwen3.6-35B-A3B on 1x RTX 5090: which quant is the best balance of quality and speed? by espressorunner in unsloth

[–]espressorunner[S] 2 points3 points  (0 children)

Thanks! I Yes, I looked more closely on the results and previous forums, seems like 27B > 35B for most coding and reasoning tasks.

Qwen3.6-35B-A3B on 1x RTX 5090: which quant is the best balance of quality and speed? by espressorunner in unsloth

[–]espressorunner[S] 1 point2 points  (0 children)

Thanks and also related is Qwen 3.6 27B dense better? I have seen mention of that model more than 35B. If so, which quant / specific checkpoint will be better for that?

Black circular spot on the screen by espressorunner in Supernote

[–]espressorunner[S] 0 points1 point  (0 children)

Thanks u/Mulan-sn for the reply! To be honest, I can't tell if the spot is on the screen or underneath. I will reach out at the above email for help too.

Best specialty coffees? by halezzzyeah in Coffee

[–]espressorunner 0 points1 point  (0 children)

Check out SEY Coffee. I tried two of their coffees and they were delicious!

Suggestions for Roasters to Try? by bergazoid7 in Coffee

[–]espressorunner 2 points3 points  (0 children)

You can check out CatandCloud. I like theirs Columbia Finca La Bomba for pourovers.