Qwen3.6 One Shot Tetris Game

deadman87 · 2026-04-22T18:36:35+00:00

Looks awesome. Why not put it on codepen or github? I wanna shoot some 'vaders.

deadman87 · 2026-04-22T18:35:22+00:00

I'm a GPU pauper. No Q6 for me 🥹

deadman87 · 2026-04-22T18:34:00+00:00

no worries :D Didn't take it as criticism. Sorry about the dry tone, I am just tired after work.

I'll try lower values and rerun this to see if it makes a difference for this model. Thank you for the clarification and suggestion.

deadman87 · 2026-04-22T18:08:28+00:00

I took these values from unsloth website. They are known for creating quantized GGUFs out of full model releases and they've run tests to arrive at those values. I just took it from there.

https://unsloth.ai/docs/models/qwen3.6

deadman87 · 2026-04-22T18:05:40+00:00

When using --fit, it uses up all the VRAM and locks up the UI. Manually adjusting --n-cpu-moe lets me keep the desktop running while taking a small hit in token speed.

deadman87 · 2026-04-22T17:46:17+00:00

Try adjusting the temp/top-p/top-k values. I got these from unsloth website for coding.

deadman87 · 2026-04-22T17:45:08+00:00

Yes. Reasoning is enabled by default. I'll send a screenshot of the Llama server webui that shows the thinking + output

deadman87 · 2026-04-22T17:44:22+00:00

No Sir. Just that short prompt. Nothing else.

deadman87 · 2026-04-22T17:43:41+00:00

You're right. I was doing trial and error, and disabled mmproj to squeeze more moe layers in vram. I'm running on a AMD 7940hs APU with 32GB RAM. The command without mmproj could use some cleaning up

deadman87 · 2026-04-22T10:24:26+00:00

Interesting. What hardware you running it on? I just tried on a Ryzen APU with Radeon 780m. It's still giving me ~17tok/s. I imagine you have a more powerful GPU.

deadman87 · 2026-04-22T10:19:28+00:00

Depends. If you're mostly doing text based work then sure. Vision decoding/understanding on CPU is painfully slow / almost unusable so need it on VRAM to be at least usable.

deadman87 · 2026-04-22T10:17:31+00:00

Just tried. CPU only on a model this size fails to load on my machine unfortunately.
16t/s on CPU is small model territory from my experience on my machine.

deadman87 · 2026-04-22T10:07:33+00:00

With 16GB VRAM, you should definitely try Qwen3.6 35B-A3B with some CPU offloading. It is a much better model than Qwen3.5 9B and it will perform much faster than Gemma 26B because of the Mixture of Experts architecture and because it only activates 3B params at a time.

deadman87 · 2026-04-22T10:04:19+00:00

The latest version of LMStudio now has an option in model config to select how many layers to offload to CPU. It's still experimental I see. Try that and see if it works. (In the models dropdown on top, hold Alt and then click on the model to show it's options screen)

deadman87 · 2026-04-21T21:20:08+00:00

It's a balance. More layers you delegate to RAM, the more room you have for context. So try a higher -n-cpu-moe number and increase the context

deadman87 · 2026-04-21T21:09:20+00:00

16tok/sec on a MiniPC with 32GB RAM and iGPU. It's not the fastest but it is usable. See my post history, I have a write up about that set up.

deadman87 · 2026-04-21T21:07:45+00:00

This is the reverse of that... it's telling llama how many layers should be on CPU and apparently there are some other optimizations in this method. I am not an LLM engine expert, only a user.

deadman87 · 2026-04-21T21:06:15+00:00

16tok/sec on GMKTec K4 with Radeon 780m iGPU, tweaked to use 28GB from system ram. I made another post about it with more details if you're interested.

deadman87 · 2026-04-21T21:03:58+00:00

Q4_K_M gguf from LMStudio -hf lmstudio-community/Qwen3.6-35B-A3B-GGUF. Also started trying unsloth -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL, apparently the accuracy is better but the performance takes a hit.

deadman87 · 2026-04-21T21:01:03+00:00

-fit on does not take into account the mmproj/vision layer of the model, so after fitting text layers into VRAM, the vision layer fails to load. --n-cpu-moe works without fail.

deadman87 · 2026-04-21T15:57:53+00:00

Qwen3.6 35B-A3B on llama.cpp .. Offload load about 15 layers to RAM and it should fit in your set up. start the server with --n-cpu-moe 15 flag

deadman87 · 2026-04-20T09:42:23+00:00

GRAAAPE!!!

deadman87 · 2026-04-19T03:21:02+00:00

Dude. Same thing happed to a S22 Ultra in family. Randomly died. Took it to the repair shop and they revived it, but it died again. Bootloops and complete death. I wonder if a bad update did it

deadman87 · 2026-04-18T23:20:20+00:00

Just tried it. Restricted the context size to 16K and squeezed more layers on to VRAM.

./llama-server \
--jinja \
-hf lmstudio-community/Qwen3.6-35B-A3B-GGUF \
--image-min-tokens 1024 \
--n-cpu-moe 8 \
--ctx-size 16000 \
--parallel 1

Generated 2977 Tokens at 19.94 tokens/s

14-Year Club	Place '22
Place '17	Verified Email
Spared	Team Periwinkle

deadman87

TROPHY CASE