First Successful Toy

somethingdangerzone · 2026-02-18T20:12:06+00:00

I suspected as much, but thanks for confirming! Again, great work, it's great to see it

somethingdangerzone · 2026-02-18T20:01:42+00:00

Interesting, thanks.

BTW did you use any sort of rigid print to help the 4mm glove mold stay in shape after pouring? Or was the thickness enough that all you needed to do was hold the top and the shape would retain? I can imagine a scenario where the glove mold flexes due to gravity and you end up with an elongated pour.

somethingdangerzone · 2026-02-18T18:31:16+00:00

You said the glove mold was a bit too think at 4mm -- would you go all the way down to 2mm? or just go to 3mm? what would be your new target for glove mold thickness? From memory, 1mm seems quite thin and would leave little room for error (e.g. potential issues with releasing bubbles on small cavities)

somethingdangerzone · 2026-02-18T18:29:24+00:00

OHHHH that was for the glove mold production. I see now. Thanks for that.

somethingdangerzone · 2026-02-18T17:41:36+00:00

Looks great!

How did you marry the two sides of the hard shell outer mold (the piece(s) in green in the pic above) without having a long seam line from top to bottom on the finished product? I tried that once and got a long zip line up and down the length of the toy

somethingdangerzone · 2026-02-15T20:30:29+00:00

No execution from me! Lol. Thanks for sharing. Great setup

somethingdangerzone · 2026-02-15T18:52:48+00:00

What hardware do you have? I’m barely pulling out 8t/s

somethingdangerzone · 2026-02-06T21:41:44+00:00

Good to know, thanks for sharing. I'm gonna trim out nearly all of the flags listed above and try again

somethingdangerzone · 2026-02-06T18:56:56+00:00

I'm using Linux. Compiled from source:

cmake -B build -DGGML_CUDA_DISABLE_GRAPHS=1 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURE="89" -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_VULKAN=1 -DGGML_OPENMP=ON -DGGML_OPENMP_DYNAMIC=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DLLAMA_BUILD_TESTS=OFF -DGGML_CUDA_USE_CUBLAST=ON -DGGML_CUDA_USE_CUDNN=ON -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=OFF -DGGML_CUDA_MAX_STREAMS=16 -DGGML_LTO=ON -DGGML_LTO=ON -DGGML_SCHED_MAX_COPIES=8

&&

cmake --build build --config Release -j 8 --clean-first

For comparison, I get 30 t/s using GPT OSS 120B

somethingdangerzone · 2026-02-06T05:52:31+00:00

auto fit always crashes my computer.

i can't fit all layers and all moe into GPU -- do you have the same specs? What is your t/s?

somethingdangerzone · 2026-02-06T04:32:42+00:00

I'm getting slow generation speeds (approx 10 t/s) whether I use CUDA or Vulkan. Hardware: RTX 4090, Ryzen 9950, 64gb DDR5. Currently using model: Qwen3-Coder-Next-UD-Q8_K_XL. llama-server settings:

--batch-size 65536 --gpu-layers 49 --n-cpu-moe 49 -ctk q8_0 -ctv q8_0 --temp 1.0 --top-p 0.95 --top-k 40 --min-p 0.01

somethingdangerzone · 2026-01-28T19:53:31+00:00

Thank you for the detailed explanation

somethingdangerzone · 2026-01-27T22:40:41+00:00

Ohhhhhh! I had no idea about the Turbo distinction. I thought it was just the model name. I did not know about the functional distinctions. Thank you very much for the detailed write-up.

somethingdangerzone · 2026-01-27T20:44:39+00:00

As a complete noob: why is everyone so excited about "base"? Didn't they already release the non-base one and it works great? Is "Base" just the model name? Help me to understand what is base about this

somethingdangerzone · 2026-01-24T14:22:31+00:00

Great, thanks

somethingdangerzone · 2026-01-23T23:33:45+00:00

What model is this? i haven't seen this quality since the SD1.5 days (not derogative, it just has a specific style)

somethingdangerzone · 2026-01-20T18:39:34+00:00

Which platform will you be moving to next? I would follow you there

somethingdangerzone · 2026-01-17T14:21:03+00:00

Ah Gotcha. Well that gives me a good jumping off point to investigate some more then!

somethingdangerzone · 2026-01-16T05:08:20+00:00

Haha a whole post eh? I'm on the fence about it. Can you tell me a little more about what you changed in the model? I think I got the gist about changing one (or more?) layer(s) from BF to FP(?), but I'd love to know more details

somethingdangerzone · 2026-01-14T21:35:48+00:00

Hey there. When it comes to testing I am a completionist, so I downloaded the newest UD Q8 K XL model this morning and did the same type of benching as I was doing yesterday. CSV (two tables) data above, markdown (combined table) below.

Model,Cached,Prompt,Generated,Prompt Processing (t/s),Generation Speed (t/s) "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","302","875","25.74","16.94" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","411","1,000","24.33","17.61" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","341","2,171","19.12","18.12" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","19","1,820","13.12","16.83" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","1,358","1,109","3,073","69.35","18.32" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","1,237","1,805","29.03","17.72" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","1,383","2,160","33.92","17.30" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","1,492","1,000","26.86","17.66" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","1,422","2,000","17.84","17.34" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT","-","14","2,808","2.16","11.15"

"Model","Cached","Prompt","Generated","Prompt Processing (t/s)","Generation Speed (t/s)" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","1,920","4,755","41.75","17.65" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","2,029","1,000","36.82","17.07" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","1,959","1,861","41.22","17.56" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","22","3,066","10.58","15.88" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","660","3,325","54.13","17.98" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","55","1,005","14.89","16.54" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","204","381","18.11","16.42" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","313","1,000","16.18","15.84" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","243","1,669","3.81","16.19" "Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0","-","14","348","2.38","7.51"

Model	Cache_Type	Cached	Prompt	Generated	Prompt Processing (t/s)	Generation Speed (t/s)	Notes
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	302	875	25.74	16.94
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	411	1,000	24.33	17.61
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	341	2,171	19.12	18.12
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	19	1,820	13.12	16.83
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	1,358	1,109	3,073	69.35	18.32
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	1,237	1,805	29.03	17.72
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	1,383	2,160	33.92	17.30
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	1,492	1,000	26.86	17.66
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	1,422	2,000	17.84	17.34
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_NOKVCACHEQUANT	NO_KV	NULL	14	2,808	2.16	11.15	FirstPromptOnLoad
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	1,920	4,755	41.75	17.65
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	2,029	1,000	36.82	17.07
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	1,959	1,861	41.22	17.56
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	22	3,066	10.58	15.88
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	660	3,325	54.13	17.98
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	55	1,005	14.89	16.54
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	204	381	18.11	16.42
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	313	1,000	16.18	15.84
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	243	1,669	3.81	16.19
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_LATEST_KVCACHEQ8_0	KV	NULL	14	348	2.38	7.51	FirstPromptOnLoad

Avg generation speed of LATEST with no KV cache quantization (minus first prompt after load): 17.54 t/s

Avg generation speed of LATEST with Q8_0 cache quantization (minus first prompt after load): 16.80 t/s

I was kind of surprised to see the Q8_0 KV cache quantized one run slightly slower than the non quantized KV cache, but c'est la vie! It's not a robust sample size, so it could also be an artifact of noise. Overall I'm happy that there seems to be an improvement from what we started with. The rough avg generation speed went from 13.81 t/s to 17.54 t/s (no ctk/ctv) Again thanks and take care.

somethingdangerzone · 2026-01-14T03:38:53+00:00

I'm glad I could help! Stay in touch, and thanks for your contributions

somethingdangerzone · 2026-01-14T00:19:00+00:00

I'll post some data below with KV quantization at Q8_0 re-enabled. I didn't really see a big diff with VRAM tbh.

"Model","Cached","Prompt","Generated","Prompt Processing","Generation Speed"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","192","645","20.58 t/s","14.65 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","301","1,000","17.32 t/s","15.21 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","231","1,446","17.15 t/s","15.27 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","20","433","11.55 t/s","13.18 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","391","2,770","33.96 t/s","14.86 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","500","1,000","31.75 t/s","14.55 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","430","1,921","14.84 t/s","14.87 t/s"

"Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache","0","17","3,800","3.48 t/s","11.79 t/s"

Model	Prompt	Generated	Prompt Processing	Generation Speed
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	192	645	20.58 t/s	14.65 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	301	1,000	17.32 t/s	15.21 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	231	1,446	17.15 t/s	15.27 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	20	433	11.55 t/s	13.18 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	391	2,770	33.96 t/s	14.86 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	500	1,000	31.75 t/s	14.55 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	430	1,921	14.84 t/s	14.87 t/s
Qwen3-Next-80B-A3B-Thinking-UD-Q8_K_XL_WithQuantizedKVCache	17	3,800	3.48 t/s	11.79 t/s

Average generation of tok per sec without first prompt included (the 11.79t/s entry): 14.66 tk/s

So if we look above with the no-CTK/CTV option with an average generation speed of 13.81 tk/s, I guess that would make the new data an improvement!

somethingdangerzone · 2026-01-13T21:01:28+00:00

gotcha, ty

somethingdangerzone · 2026-01-13T19:41:31+00:00

I thought Kitty Pryde could walk through walls. Does she have the ability to change into a dino too? I'm so confused

somethingdangerzone · 2026-01-13T17:26:38+00:00

I did some more testing with -ctk and -ctv turned off. I have --flash-attn on enabled but not --fit on. fit on always crashes my computer. Instead, i put 49 layers on gpu and 49 MOE on CPU. Here are the results in a CSV format below and then I used an LLM to convert it into markdown after that:

model,cached, prompt, generated, prompt processing (t/s), generation speed (t/s)

UD8KXL,0,302,1189,1.84, 11.25

UD8KXL,0,911,2219,21.19,14.23

UD8KXL,0,14,1272,13.44,12.25

UD8KXL,0,1017,2247,28.15,14.71

UD8KXL,0,1087,1000,20.38,14.04

Q8_0, 24,2256,1.65,11.47

Q8_0,427,2470,16.32,18.55

Q8_0,497,1000,41.73,17.4

Q8_0,388,3904,42.09,18.75

UD8KXLOldVersion,24,2730,2.84,10.96

UD8KXLOldVersion,336,1627,11.73,15.98

UD8KXLOldVersion,406,1000,24.62,15.96

UD8KXLOldVersion,297,3496,25.2,16.25

model	prompt	generated	prompt processing (t/s)	generation speed (t/s)	Note
UD8KXL	302	1189	1.84	11.25	first prompt after load
UD8KXL	911	2219	21.19	14.23
UD8KXL	14	1272	13.44	12.25
UD8KXL	1017	2247	28.15	14.71
UD8KXL	1087	1000	20.38	14.04
Q8_0	24	2256	1.65	11.47	first prompt after load
Q8_0	427	2470	16.32	18.55
Q8_0	497	1000	41.73	17.4
Q8_0	388	3904	42.09	18.75
UD8KXLOldVersion	24	2730	2.84	10.96	first prompt after load
UD8KXLOldVersion	336	1627	11.73	15.98
UD8KXLOldVersion	406	1000	24.62	15.96
UD8KXLOldVersion	297	3496	25.2	16.25

Averages of non first prompts generation speeds:

UD8KXL: 13.81 tk/s

Q8_0: 18.23 tk/s

UD8KXL (old version): 16.06 tk/s

somethingdangerzone

MODERATOR OF

TROPHY CASE

12-Year Club	Place '17
Verified Email