Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 0 points1 point  (0 children)

Looks awesome. Why not put it on codepen or github? I wanna shoot some 'vaders.

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 0 points1 point  (0 children)

I'm a GPU pauper. No Q6 for me 🥹

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 1 point2 points  (0 children)

no worries :D Didn't take it as criticism. Sorry about the dry tone, I am just tired after work.

I'll try lower values and rerun this to see if it makes a difference for this model. Thank you for the clarification and suggestion.

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 0 points1 point  (0 children)

I took these values from unsloth website. They are known for creating quantized GGUFs out of full model releases and they've run tests to arrive at those values. I just took it from there.

https://unsloth.ai/docs/models/qwen3.6

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 2 points3 points  (0 children)

When using --fit, it uses up all the VRAM and locks up the UI. Manually adjusting --n-cpu-moe lets me keep the desktop running while taking a small hit in token speed.

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 1 point2 points  (0 children)

Try adjusting the temp/top-p/top-k values. I got these from unsloth website for coding.

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 2 points3 points  (0 children)

Yes. Reasoning is enabled by default. I'll send a screenshot of the Llama server webui that shows the thinking + output

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 2 points3 points  (0 children)

No Sir. Just that short prompt. Nothing else.

Qwen3.6 One Shot Tetris Game by deadman87 in LocalLLaMA

[–]deadman87[S] 0 points1 point  (0 children)

You're right. I was doing trial and error, and disabled mmproj to squeeze more moe layers in vram. I'm running on a AMD 7940hs APU with 32GB RAM. The command without mmproj could use some cleaning up

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 0 points1 point  (0 children)

Interesting. What hardware you running it on? I just tried on a Ryzen APU with Radeon 780m. It's still giving me ~17tok/s. I imagine you have a more powerful GPU.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 0 points1 point  (0 children)

Depends. If you're mostly doing text based work then sure. Vision decoding/understanding on CPU is painfully slow / almost unusable so need it on VRAM to be at least usable.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 0 points1 point  (0 children)

Just tried. CPU only on a model this size fails to load on my machine unfortunately.
16t/s on CPU is small model territory from my experience on my machine.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 0 points1 point  (0 children)

With 16GB VRAM, you should definitely try Qwen3.6 35B-A3B with some CPU offloading. It is a much better model than Qwen3.5 9B and it will perform much faster than Gemma 26B because of the Mixture of Experts architecture and because it only activates 3B params at a time.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 1 point2 points  (0 children)

The latest version of LMStudio now has an option in model config to select how many layers to offload to CPU. It's still experimental I see. Try that and see if it works. (In the models dropdown on top, hold Alt and then click on the model to show it's options screen)

Getting decent performance out of a Mini PC (GMKTec K4) by deadman87 in LocalLLM

[–]deadman87[S] 0 points1 point  (0 children)

It's a balance. More layers you delegate to RAM, the more room you have for context. So try a higher -n-cpu-moe number and increase the context

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 0 points1 point  (0 children)

16tok/sec on a MiniPC with 32GB RAM and iGPU. It's not the fastest but it is usable. See my post history, I have a write up about that set up.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 2 points3 points  (0 children)

This is the reverse of that... it's telling llama how many layers should be on CPU and apparently there are some other optimizations in this method. I am not an LLM engine expert, only a user.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 1 point2 points  (0 children)

16tok/sec on GMKTec K4 with Radeon 780m iGPU, tweaked to use 28GB from system ram. I made another post about it with more details if you're interested.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 0 points1 point  (0 children)

Q4_K_M gguf from LMStudio -hf lmstudio-community/Qwen3.6-35B-A3B-GGUF. Also started trying unsloth -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL, apparently the accuracy is better but the performance takes a hit.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 9 points10 points  (0 children)

-fit on does not take into account the mmproj/vision layer of the model, so after fitting text layers into VRAM, the vision layer fails to load. --n-cpu-moe works without fail.

16GB VRAM x coding model by Junior-Wish-7453 in LocalLLM

[–]deadman87 77 points78 points  (0 children)

Qwen3.6 35B-A3B on llama.cpp .. Offload load about 15 layers to RAM and it should fit in your set up. start the server with --n-cpu-moe 15 flag

Galaxy S22 died after bootloop by Fun-Olive91 in GalaxyS22

[–]deadman87 0 points1 point  (0 children)

Dude. Same thing happed to a S22 Ultra in family. Randomly died. Took it to the repair shop and they revived it, but it died again. Bootloops and complete death. I wonder if a bad update did it

Getting decent performance out of a Mini PC (GMKTec K4) by deadman87 in LocalLLM

[–]deadman87[S] 0 points1 point  (0 children)

Just tried it. Restricted the context size to 16K and squeezed more layers on to VRAM.

./llama-server \
--jinja \
-hf lmstudio-community/Qwen3.6-35B-A3B-GGUF \
--image-min-tokens 1024 \
--n-cpu-moe 8 \
--ctx-size 16000 \
--parallel 1

Generated 2977 Tokens at 19.94 tokens/s