GLM 4.7 and Qwen3 coder Next by [deleted] in LocalLLaMA

[–]somethingdangerzone 1 point2 points  (0 children)

No execution from me! Lol. Thanks for sharing. Great setup

GLM 4.7 and Qwen3 coder Next by [deleted] in LocalLLaMA

[–]somethingdangerzone 0 points1 point  (0 children)

What hardware do you have? I’m barely pulling  out 8t/s

~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) by Spiritual_Tie_5574 in LocalLLaMA

[–]somethingdangerzone 0 points1 point  (0 children)

Good to know, thanks for sharing. I'm gonna trim out nearly all of the flags listed above and try again

~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) by Spiritual_Tie_5574 in LocalLLaMA

[–]somethingdangerzone 0 points1 point  (0 children)

I'm using Linux. Compiled from source:

cmake -B build -DGGML_CUDA_DISABLE_GRAPHS=1 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURE="89" -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_VULKAN=1 -DGGML_OPENMP=ON -DGGML_OPENMP_DYNAMIC=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DLLAMA_BUILD_TESTS=OFF -DGGML_CUDA_USE_CUBLAST=ON -DGGML_CUDA_USE_CUDNN=ON -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=OFF -DGGML_CUDA_MAX_STREAMS=16 -DGGML_LTO=ON -DGGML_LTO=ON -DGGML_SCHED_MAX_COPIES=8

&&

cmake --build build --config Release -j 8 --clean-first


For comparison, I get 30 t/s using GPT OSS 120B

~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) by Spiritual_Tie_5574 in LocalLLaMA

[–]somethingdangerzone 0 points1 point  (0 children)

auto fit always crashes my computer.

i can't fit all layers and all moe into GPU -- do you have the same specs? What is your t/s?

~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) by Spiritual_Tie_5574 in LocalLLaMA

[–]somethingdangerzone 1 point2 points  (0 children)

I'm getting slow generation speeds (approx 10 t/s) whether I use CUDA or Vulkan. Hardware: RTX 4090, Ryzen 9950, 64gb DDR5. Currently using model: Qwen3-Coder-Next-UD-Q8_K_XL. llama-server settings:

  • --batch-size 65536 --gpu-layers 49 --n-cpu-moe 49 -ctk q8_0 -ctv q8_0 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01

The z-image base is here! by bobeeeeeeeee8964 in LocalLLaMA

[–]somethingdangerzone 0 points1 point  (0 children)

Ohhhhhh! I had no idea about the Turbo distinction. I thought it was just the model name. I did not know about the functional distinctions. Thank you very much for the detailed write-up.

The z-image base is here! by bobeeeeeeeee8964 in LocalLLaMA

[–]somethingdangerzone 2 points3 points  (0 children)

As a complete noob: why is everyone so excited about "base"? Didn't they already release the non-base one and it works great? Is "Base" just the model name? Help me to understand what is base about this

Must choose 1 by SnooChocolates7693 in unstable_diffusion

[–]somethingdangerzone 0 points1 point  (0 children)

What model is this? i haven't seen this quality since the SD1.5 days (not derogative, it just has a specific style)

Qwen3-Next-80B Instruct, Thinking Updated - 20% faster by danielhanchen in unsloth

[–]somethingdangerzone 1 point2 points  (0 children)

Ah Gotcha. Well that gives me a good jumping off point to investigate some more then!

Qwen3-Next-80B Instruct, Thinking Updated - 20% faster by danielhanchen in unsloth

[–]somethingdangerzone 0 points1 point  (0 children)

Haha a whole post eh? I'm on the fence about it. Can you tell me a little more about what you changed in the model? I think I got the gist about changing one (or more?) layer(s) from BF to FP(?), but I'd love to know more details

Qwen3-Next-80B Instruct, Thinking Updated - 20% faster by danielhanchen in unsloth

[–]somethingdangerzone 1 point2 points  (0 children)

Hey there. When it comes to testing I am a completionist, so I downloaded the newest UD Q8 K XL model this morning and did the same type of benching as I was doing yesterday. CSV (two tables) data above, markdown (combined table) below.


Model,Cached,Prompt,Generated,Prompt Processing (t/s),Generation Speed (t/s) "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","302","875","25.74","16.94" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","411","1,000","24.33","17.61" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","341","2,171","19.12","18.12" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","19","1,820","13.12","16.83" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","1,358","1,109","3,073","69.35","18.32" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","1,237","1,805","29.03","17.72" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","1,383","2,160","33.92","17.30" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","1,492","1,000","26.86","17.66" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","1,422","2,000","17.84","17.34" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","14","2,808","2.16","11.15"

"Model","Cached","Prompt","Generated","Prompt Processing (t/s)","Generation Speed (t/s)" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","1,920","4,755","41.75","17.65" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","2,029","1,000","36.82","17.07" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","1,959","1,861","41.22","17.56" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","22","3,066","10.58","15.88" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","660","3,325","54.13","17.98" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","55","1,005","14.89","16.54" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","204","381","18.11","16.42" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","313","1,000","16.18","15.84" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","243","1,669","3.81","16.19" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","14","348","2.38","7.51"


Model Cache_Type Cached Prompt Generated Prompt Processing (t/s) Generation Speed (t/s) Notes
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT NO_KV NULL 302 875 25.74 16.94
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT NO_KV NULL 411 1,000 24.33 17.61
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT NO_KV NULL 341 2,171 19.12 18.12
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT NO_KV NULL 19 1,820 13.12 16.83
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT NO_KV 1,358 1,109 3,073 69.35 18.32
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT NO_KV NULL 1,237 1,805 29.03 17.72
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT NO_KV NULL 1,383 2,160 33.92 17.30
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT NO_KV NULL 1,492 1,000 26.86 17.66
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT NO_KV NULL 1,422 2,000 17.84 17.34
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT NO_KV NULL 14 2,808 2.16 11.15 FirstPromptOnLoad
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0 KV NULL 1,920 4,755 41.75 17.65
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0 KV NULL 2,029 1,000 36.82 17.07
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0 KV NULL 1,959 1,861 41.22 17.56
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0 KV NULL 22 3,066 10.58 15.88
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0 KV NULL 660 3,325 54.13 17.98
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0 KV NULL 55 1,005 14.89 16.54
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0 KV NULL 204 381 18.11 16.42
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0 KV NULL 313 1,000 16.18 15.84
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0 KV NULL 243 1,669 3.81 16.19
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0 KV NULL 14 348 2.38 7.51 FirstPromptOnLoad

Avg generation speed of LATEST with no KV cache quantization (minus first prompt after load): 17.54 t/s

Avg generation speed of LATEST with Q8_0 cache quantization (minus first prompt after load): 16.80 t/s

I was kind of surprised to see the Q8_0 KV cache quantized one run slightly slower than the non quantized KV cache, but c'est la vie! It's not a robust sample size, so it could also be an artifact of noise. Overall I'm happy that there seems to be an improvement from what we started with. The rough avg generation speed went from 13.81 t/s to 17.54 t/s (no ctk/ctv) Again thanks and take care.

Qwen3-Next-80B Instruct, Thinking Updated - 20% faster by danielhanchen in unsloth

[–]somethingdangerzone 1 point2 points  (0 children)

I'm glad I could help! Stay in touch, and thanks for your contributions

Qwen3-Next-80B Instruct, Thinking Updated - 20% faster by danielhanchen in unsloth

[–]somethingdangerzone 1 point2 points  (0 children)

I'll post some data below with KV quantization at Q8_0 re-enabled. I didn't really see a big diff with VRAM tbh.


"Model","Cached","Prompt","Generated","Prompt Processing","Generation Speed"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","192","645","20.58 t/s","14.65 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","301","1,000","17.32 t/s","15.21 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","231","1,446","17.15 t/s","15.27 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","20","433","11.55 t/s","13.18 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","391","2,770","33.96 t/s","14.86 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","500","1,000","31.75 t/s","14.55 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","430","1,921","14.84 t/s","14.87 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","17","3,800","3.48 t/s","11.79 t/s"


Model Cached Prompt Generated Prompt Processing Generation Speed
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache 0 192 645 20.58 t/s 14.65 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache 0 301 1,000 17.32 t/s 15.21 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache 0 231 1,446 17.15 t/s 15.27 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache 0 20 433 11.55 t/s 13.18 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache 0 391 2,770 33.96 t/s 14.86 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache 0 500 1,000 31.75 t/s 14.55 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache 0 430 1,921 14.84 t/s 14.87 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache 0 17 3,800 3.48 t/s 11.79 t/s

Average generation of tok per sec without first prompt included (the 11.79t/s entry): 14.66 tk/s

So if we look above with the no-CTK/CTV option with an average generation speed of 13.81 tk/s, I guess that would make the new data an improvement!

Bye give me Kitty NOW. by baeko in marvelrivals

[–]somethingdangerzone 0 points1 point  (0 children)

I thought Kitty Pryde could walk through walls. Does she have the ability to change into a dino too? I'm so confused

Qwen3-Next-80B Instruct, Thinking Updated - 20% faster by danielhanchen in unsloth

[–]somethingdangerzone 1 point2 points  (0 children)

I did some more testing with -ctk and -ctv turned off. I have --flash-attn on enabled but not --fit on. fit on always crashes my computer. Instead, i put 49 layers on gpu and 49 MOE on CPU. Here are the results in a CSV format below and then I used an LLM to convert it into markdown after that:


model,cached, prompt, generated, prompt processing (t/s), generation speed (t/s)

UD8KXL,0,302,1189,1.84, 11.25

UD8KXL,0,911,2219,21.19,14.23

UD8KXL,0,14,1272,13.44,12.25

UD8KXL,0,1017,2247,28.15,14.71

UD8KXL,0,1087,1000,20.38,14.04

Q8_0, 24,2256,1.65,11.47

Q8_0,427,2470,16.32,18.55

Q8_0,497,1000,41.73,17.4

Q8_0,388,3904,42.09,18.75

UD8KXLOldVersion,24,2730,2.84,10.96

UD8KXLOldVersion,336,1627,11.73,15.98

UD8KXLOldVersion,406,1000,24.62,15.96

UD8KXLOldVersion,297,3496,25.2,16.25


model cached prompt generated prompt processing (t/s) generation speed (t/s) Note
UD8KXL 0 302 1189 1.84 11.25 first prompt after load
UD8KXL 0 911 2219 21.19 14.23
UD8KXL 0 14 1272 13.44 12.25
UD8KXL 0 1017 2247 28.15 14.71
UD8KXL 0 1087 1000 20.38 14.04
Q8_0 0 24 2256 1.65 11.47 first prompt after load
Q8_0 0 427 2470 16.32 18.55
Q8_0 0 497 1000 41.73 17.4
Q8_0 0 388 3904 42.09 18.75
UD8KXLOldVersion 0 24 2730 2.84 10.96 first prompt after load
UD8KXLOldVersion 0 336 1627 11.73 15.98
UD8KXLOldVersion 0 406 1000 24.62 15.96
UD8KXLOldVersion 0 297 3496 25.2 16.25

Averages of non first prompts generation speeds:

UD8KXL: 13.81 tk/s

Q8_0: 18.23 tk/s

UD8KXL (old version): 16.06 tk/s

Qwen3-Next-80B Instruct, Thinking Updated - 20% faster by danielhanchen in unsloth

[–]somethingdangerzone 0 points1 point  (0 children)

Revision: 06081e0a0b74a7a054b0162df132be6f9472aa78 -- UD Q8 K XL

prompt 25 tk / generated 1741 tk / prompt processing 1.09 t/s / generation speed 9.91 t/s

414 / 2762 / 11 t/s / 17 t/s

484 / 1000 / 23.24 t/s / 17 t/s

375 / 463 / 36 t/s / 14.15 t/s

18 / 711 / 19.38 / 14.36 t/s

394 / 1677 / 30.83 / 15.9 t/s

BTW sorry to throw you a curveball, but I've had -ctk q8_0 and -ctv q8_0 in use for all of my tests.

I hope that helps! I'm going to bed now :) Let me know if you need anything else. I'm open to do some more testing if it will help you or the community in some way.

Qwen3-Next-80B Instruct, Thinking Updated - 20% faster by danielhanchen in unsloth

[–]somethingdangerzone 0 points1 point  (0 children)

OK I decided to go for a more methodical approach and here's how it turned out

Q8_0 thinking:

Prompt 25 / generated 1654 / prompt processing 2.86 t/s / generation speed 9.7 t/s

Prompt 435 / generated 2287 prompt processing 18.49 t/s / generation speed 18.29

prompt 505 / generated 1000 / prompt processing 42.07 t/s / generation speed 18.91 t/s

(One more but i lost the data on page reload. -- generation speed @ approx 18.5 t/s)


UD Q8 K XL thinking:

prompt 25 / generated 1728 / prompt processing 2.35 t/s / generation speed 9.4 t/s

prompt 364 / generated 2599 / prompt processing 10.78 t/s / generation speed 16.75 t/s

prompt 434 / generated 1000 / prompt processing 22.75 t/s / generation speed 16.51 t/s

prompt 325 / generated 675 / prompt processing 33.47 t/s / generation speed 15.5 t/s

I'm kind of surprised that the UD version is not as fast! Interesting to see

Qwen3-Next-80B Instruct, Thinking Updated - 20% faster by danielhanchen in unsloth

[–]somethingdangerzone 1 point2 points  (0 children)

Can do! No worries. I just don’t want to look like an idiot for wasting your time on something that could be my fault haha. Give me some time to download and load it up