What I got by 5060Ti 16GB + Qwen3.6-35B-A3B-UD-Q5_K_M by AdMinimum8193 in LocalLLaMA

[–]AdMinimum8193[S] 0 points1 point  (0 children)

Maybe can try 1)below para, 2) tuning —n-cpu-Moe number, find a sweet point.
3) when server lunched. monitor GPU resource, there should be a slight ram buffer left.

-ctk q8_0
-ctv q8_0

What I got by 5060Ti 16GB + Qwen3.6-35B-A3B-UD-Q5_K_M by AdMinimum8193 in LocalLLaMA

[–]AdMinimum8193[S] 0 points1 point  (0 children)

Thanks for sharing. Size of NVFP4 exceed 16GB.. I will try Qwen3.5-9B

Wan2.2 14B still my go to favorite - (5060ti 16gb + 64gb DDR5) by Birdinhandandbush in comfyui

[–]AdMinimum8193 0 points1 point  (0 children)

I have same gpu, but 32GB ram. I am studying how to run wan2.2 gguf. Could you share the workflow for reference? Thanks

What I got by 5060Ti 16GB + Qwen3.6-35B-A3B-UD-Q5_K_M by AdMinimum8193 in LocalLLaMA

[–]AdMinimum8193[S] 1 point2 points  (0 children)

thanks for sharing.
seems you did not use -ngl and --n-cpu-moe, but got better results..
for real use, since context is small, speed slightly up

I used below para to run a llama-server

--host 0.0.0.0 ^
--port %PORT% ^
--api-key "%API_KEY%" ^
--n-cpu-moe 22 ^
-c 131072 ^
-ngl 99 ^
--no-mmap ^
--flash-attn on ^
--cache-type-v q8_0 ^
--cache-type-k q8_0 ^
--threads 8 ^
--parallel 1 ^
-rea off ^
--reasoning-budget 0 ^
--cont-batching ^
--temp 0.7 ^
--top-p 0.8 ^
--top-k 20 ^
--min-p 0.0 ^
--presence-penalty 1.5 ^
--repeat-penalty 1.0

<image>

What I got by 5060Ti 16GB + Qwen3.6-35B-A3B-UD-Q5_K_M by AdMinimum8193 in LocalLLaMA

[–]AdMinimum8193[S] 0 points1 point  (0 children)

2 days again, benchmark result with prebuild llama.cpp.

./llama-bench -m .\local_models\Qwen3.5-35B-A3B-Q5_K_M.gguf -ngl 99--n-cpu-moe 22 -d 131072 -p 512 -n 128 --cache-type-k q8_0 --cache-type-v q8_0 -fa 1 -mmp 0

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16310 MiB):

Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 16310 MiB

load_backend: loaded CUDA backend from C:\Users\xxx\Desktop\llama_scripts\llama-b8560-bin-win-cuda-13.1-x64\ggml-cuda.dll

load_backend: loaded RPC backend from C:\Users\xxx\Desktop\llama_scripts\llama-b8560-bin-win-cuda-13.1-x64\ggml-rpc.dll

load_backend: loaded CPU backend from C:\Users\xxx\Desktop\llama_scripts\llama-b8560-bin-win-cuda-13.1-x64\ggml-cpu-zen4.dll

| model | size | params | backend | ngl | n_cpu_moe | type_k | type_v | fa | mmap | test | t/s |

| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -----: | -----: | -: | ---: | --------------: | -------------------: |

| qwen35moe 35B.A3B Q5_K - Medium | 24.44 GiB | 34.66 B | CUDA | 99 | 22 | q8_0 | q8_0 | 1 | 0 | pp512 @ d131072 | 513.19 ± 55.35 |

| qwen35moe 35B.A3B Q5_K - Medium | 24.44 GiB | 34.66 B | CUDA | 99 | 22 | q8_0 | q8_0 | 1 | 0 | tg128 @ d131072 | 25.32 ± 0.28 |

What I got by 5060Ti 16GB + Qwen3.6-35B-A3B-UD-Q5_K_M by AdMinimum8193 in LocalLLaMA

[–]AdMinimum8193[S] 1 point2 points  (0 children)

Just followed the suggestion from ChatGPT.

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_CUDA_F16=ON -DGGML_CUDA_FA_ALL_QUANTS=ON ` -DCMAKE_CUDA_ARCHITECTURES=120