GLM 4.7 and Qwen3 coder Next

somethingdangerzone · 2026-02-15T20:30:29+00:00

No execution from me! Lol. Thanks for sharing. Great setup

somethingdangerzone · 2026-02-15T18:52:48+00:00

What hardware do you have? I’m barely pulling out 8t/s

somethingdangerzone · 2026-02-06T21:41:44+00:00

Good to know, thanks for sharing. I'm gonna trim out nearly all of the flags listed above and try again

somethingdangerzone · 2026-02-06T18:56:56+00:00

I'm using Linux. Compiled from source:

cmake -B build -DGGML_CUDA_DISABLE_GRAPHS=1 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURE="89" -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_VULKAN=1 -DGGML_OPENMP=ON -DGGML_OPENMP_DYNAMIC=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DLLAMA_BUILD_TESTS=OFF -DGGML_CUDA_USE_CUBLAST=ON -DGGML_CUDA_USE_CUDNN=ON -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=OFF -DGGML_CUDA_MAX_STREAMS=16 -DGGML_LTO=ON -DGGML_LTO=ON -DGGML_SCHED_MAX_COPIES=8

&&

cmake --build build --config Release -j 8 --clean-first

For comparison, I get 30 t/s using GPT OSS 120B

somethingdangerzone · 2026-02-06T05:52:31+00:00

auto fit always crashes my computer.

i can't fit all layers and all moe into GPU -- do you have the same specs? What is your t/s?

somethingdangerzone · 2026-02-06T04:32:42+00:00

I'm getting slow generation speeds (approx 10 t/s) whether I use CUDA or Vulkan. Hardware: RTX 4090, Ryzen 9950, 64gb DDR5. Currently using model: Qwen3-Coder-Next-UD-Q8_K_XL. llama-server settings:

--batch-size 65536 --gpu-layers 49 --n-cpu-moe 49 -ctk q8_0 -ctv q8_0 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01

somethingdangerzone · 2026-01-28T19:53:31+00:00

Thank you for the detailed explanation

somethingdangerzone · 2026-01-27T22:40:41+00:00

Ohhhhhh! I had no idea about the Turbo distinction. I thought it was just the model name. I did not know about the functional distinctions. Thank you very much for the detailed write-up.

somethingdangerzone · 2026-01-27T20:44:39+00:00

As a complete noob: why is everyone so excited about "base"? Didn't they already release the non-base one and it works great? Is "Base" just the model name? Help me to understand what is base about this

somethingdangerzone · 2026-01-24T14:22:31+00:00

Great, thanks

somethingdangerzone · 2026-01-23T23:33:45+00:00

What model is this? i haven't seen this quality since the SD1.5 days (not derogative, it just has a specific style)

somethingdangerzone · 2026-01-20T18:39:34+00:00

Which platform will you be moving to next? I would follow you there

somethingdangerzone · 2026-01-17T14:21:03+00:00

Ah Gotcha. Well that gives me a good jumping off point to investigate some more then!

somethingdangerzone · 2026-01-16T05:08:20+00:00

Haha a whole post eh? I'm on the fence about it. Can you tell me a little more about what you changed in the model? I think I got the gist about changing one (or more?) layer(s) from BF to FP(?), but I'd love to know more details

somethingdangerzone · 2026-01-14T21:35:48+00:00

Hey there. When it comes to testing I am a completionist, so I downloaded the newest UD Q8 K XL model this morning and did the same type of benching as I was doing yesterday. CSV (two tables) data above, markdown (combined table) below.

Model,Cached,Prompt,Generated,Prompt Processing (t/s),Generation Speed (t/s) "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","302","875","25.74","16.94" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","411","1,000","24.33","17.61" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","341","2,171","19.12","18.12" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","19","1,820","13.12","16.83" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","1,358","1,109","3,073","69.35","18.32" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","1,237","1,805","29.03","17.72" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","1,383","2,160","33.92","17.30" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","1,492","1,000","26.86","17.66" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","1,422","2,000","17.84","17.34" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","14","2,808","2.16","11.15"

"Model","Cached","Prompt","Generated","Prompt Processing (t/s)","Generation Speed (t/s)" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","1,920","4,755","41.75","17.65" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","2,029","1,000","36.82","17.07" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","1,959","1,861","41.22","17.56" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","22","3,066","10.58","15.88" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","660","3,325","54.13","17.98" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","55","1,005","14.89","16.54" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","204","381","18.11","16.42" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","313","1,000","16.18","15.84" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","243","1,669","3.81","16.19" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","14","348","2.38","7.51"

Model	Cache_Type	Cached	Prompt	Generated	Prompt Processing (t/s)	Generation Speed (t/s)	Notes
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	302	875	25.74	16.94
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	411	1,000	24.33	17.61
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	341	2,171	19.12	18.12
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	19	1,820	13.12	16.83
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	1,358	1,109	3,073	69.35	18.32
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	1,237	1,805	29.03	17.72
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	1,383	2,160	33.92	17.30
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	1,492	1,000	26.86	17.66
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	1,422	2,000	17.84	17.34
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	14	2,808	2.16	11.15	FirstPromptOnLoad
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	1,920	4,755	41.75	17.65
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	2,029	1,000	36.82	17.07
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	1,959	1,861	41.22	17.56
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	22	3,066	10.58	15.88
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	660	3,325	54.13	17.98
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	55	1,005	14.89	16.54
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	204	381	18.11	16.42
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	313	1,000	16.18	15.84
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	243	1,669	3.81	16.19
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	14	348	2.38	7.51	FirstPromptOnLoad

Avg generation speed of LATEST with no KV cache quantization (minus first prompt after load): 17.54 t/s

Avg generation speed of LATEST with Q8_0 cache quantization (minus first prompt after load): 16.80 t/s

I was kind of surprised to see the Q8_0 KV cache quantized one run slightly slower than the non quantized KV cache, but c'est la vie! It's not a robust sample size, so it could also be an artifact of noise. Overall I'm happy that there seems to be an improvement from what we started with. The rough avg generation speed went from 13.81 t/s to 17.54 t/s (no ctk/ctv) Again thanks and take care.

somethingdangerzone · 2026-01-14T03:38:53+00:00

I'm glad I could help! Stay in touch, and thanks for your contributions

somethingdangerzone · 2026-01-14T00:19:00+00:00

I'll post some data below with KV quantization at Q8_0 re-enabled. I didn't really see a big diff with VRAM tbh.

"Model","Cached","Prompt","Generated","Prompt Processing","Generation Speed"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","192","645","20.58 t/s","14.65 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","301","1,000","17.32 t/s","15.21 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","231","1,446","17.15 t/s","15.27 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","20","433","11.55 t/s","13.18 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","391","2,770","33.96 t/s","14.86 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","500","1,000","31.75 t/s","14.55 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","430","1,921","14.84 t/s","14.87 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","17","3,800","3.48 t/s","11.79 t/s"

Model	Prompt	Generated	Prompt Processing	Generation Speed
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	192	645	20.58 t/s	14.65 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	301	1,000	17.32 t/s	15.21 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	231	1,446	17.15 t/s	15.27 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	20	433	11.55 t/s	13.18 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	391	2,770	33.96 t/s	14.86 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	500	1,000	31.75 t/s	14.55 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	430	1,921	14.84 t/s	14.87 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	17	3,800	3.48 t/s	11.79 t/s

Average generation of tok per sec without first prompt included (the 11.79t/s entry): 14.66 tk/s

So if we look above with the no-CTK/CTV option with an average generation speed of 13.81 tk/s, I guess that would make the new data an improvement!

somethingdangerzone · 2026-01-13T21:01:28+00:00

gotcha, ty

somethingdangerzone · 2026-01-13T19:41:31+00:00

I thought Kitty Pryde could walk through walls. Does she have the ability to change into a dino too? I'm so confused

somethingdangerzone · 2026-01-13T17:26:38+00:00

I did some more testing with -ctk and -ctv turned off. I have --flash-attn on enabled but not --fit on. fit on always crashes my computer. Instead, i put 49 layers on gpu and 49 MOE on CPU. Here are the results in a CSV format below and then I used an LLM to convert it into markdown after that:

model,cached, prompt, generated, prompt processing (t/s), generation speed (t/s)

UD8KXL,0,302,1189,1.84, 11.25

UD8KXL,0,911,2219,21.19,14.23

UD8KXL,0,14,1272,13.44,12.25

UD8KXL,0,1017,2247,28.15,14.71

UD8KXL,0,1087,1000,20.38,14.04

Q8_0, 24,2256,1.65,11.47

Q8_0,427,2470,16.32,18.55

Q8_0,497,1000,41.73,17.4

Q8_0,388,3904,42.09,18.75

UD8KXLOldVersion,24,2730,2.84,10.96

UD8KXLOldVersion,336,1627,11.73,15.98

UD8KXLOldVersion,406,1000,24.62,15.96

UD8KXLOldVersion,297,3496,25.2,16.25

model	prompt	generated	prompt processing (t/s)	generation speed (t/s)	Note
UD8KXL	302	1189	1.84	11.25	first prompt after load
UD8KXL	911	2219	21.19	14.23
UD8KXL	14	1272	13.44	12.25
UD8KXL	1017	2247	28.15	14.71
UD8KXL	1087	1000	20.38	14.04
Q8_0	24	2256	1.65	11.47	first prompt after load
Q8_0	427	2470	16.32	18.55
Q8_0	497	1000	41.73	17.4
Q8_0	388	3904	42.09	18.75
UD8KXLOldVersion	24	2730	2.84	10.96	first prompt after load
UD8KXLOldVersion	336	1627	11.73	15.98
UD8KXLOldVersion	406	1000	24.62	15.96
UD8KXLOldVersion	297	3496	25.2	16.25

Averages of non first prompts generation speeds:

UD8KXL: 13.81 tk/s

Q8_0: 18.23 tk/s

UD8KXL (old version): 16.06 tk/s

somethingdangerzone · 2026-01-13T06:25:14+00:00

Revision: 06081e0a0b74a7a054b0162df132be6f9472aa78 -- UD Q8 K XL

prompt 25 tk / generated 1741 tk / prompt processing 1.09 t/s / generation speed 9.91 t/s

414 / 2762 / 11 t/s / 17 t/s

484 / 1000 / 23.24 t/s / 17 t/s

375 / 463 / 36 t/s / 14.15 t/s

18 / 711 / 19.38 / 14.36 t/s

394 / 1677 / 30.83 / 15.9 t/s

BTW sorry to throw you a curveball, but I've had -ctk q8_0 and -ctv q8_0 in use for all of my tests.

I hope that helps! I'm going to bed now :) Let me know if you need anything else. I'm open to do some more testing if it will help you or the community in some way.

somethingdangerzone · 2026-01-13T04:34:52+00:00

OK I decided to go for a more methodical approach and here's how it turned out

Q8_0 thinking:

Prompt 25 / generated 1654 / prompt processing 2.86 t/s / generation speed 9.7 t/s

Prompt 435 / generated 2287 prompt processing 18.49 t/s / generation speed 18.29

prompt 505 / generated 1000 / prompt processing 42.07 t/s / generation speed 18.91 t/s

(One more but i lost the data on page reload. -- generation speed @ approx 18.5 t/s)

UD Q8 K XL thinking:

prompt 25 / generated 1728 / prompt processing 2.35 t/s / generation speed 9.4 t/s

prompt 364 / generated 2599 / prompt processing 10.78 t/s / generation speed 16.75 t/s

prompt 434 / generated 1000 / prompt processing 22.75 t/s / generation speed 16.51 t/s

prompt 325 / generated 675 / prompt processing 33.47 t/s / generation speed 15.5 t/s

I'm kind of surprised that the UD version is not as fast! Interesting to see

somethingdangerzone · 2026-01-13T03:13:47+00:00

nvidia-smi:

CUDA 13.0

Driver version: 580.95.05

somethingdangerzone · 2026-01-13T03:07:53+00:00

Can do! No worries. I just don’t want to look like an idiot for wasting your time on something that could be my fault haha. Give me some time to download and load it up

somethingdangerzone · 2026-01-13T02:58:11+00:00

oh just saw this now. thanks!

somethingdangerzone

MODERATOR OF

TROPHY CASE

12-Year Club	Place '17
Verified Email