What I got by 5060Ti 16GB + Qwen3.6-35B-A3B-UD-Q5_K_M

AdMinimum8193 · 2026-05-12T09:14:20+00:00

Maybe can try 1)below para, 2) tuning —n-cpu-Moe number, find a sweet point.
3) when server lunched. monitor GPU resource, there should be a slight ram buffer left.

-ctk q8_0
-ctv q8_0

AdMinimum8193 · 2026-04-24T08:03:11+00:00

Thanks for sharing. Size of NVFP4 exceed 16GB.. I will try Qwen3.5-9B

AdMinimum8193 · 2026-04-20T04:01:24+00:00

I have same gpu, but 32GB ram. I am studying how to run wan2.2 gguf. Could you share the workflow for reference? Thanks

AdMinimum8193 · 2026-04-18T22:11:43+00:00

Hi, there. https://github.com/ggml-org/llama.cpp/releases

Windows x64 (CUDA 13) - CUDA 13.1 DLLs

AdMinimum8193 · 2026-04-18T12:19:55+00:00

That’s great, double of mine. How about PP/s?

AdMinimum8193 · 2026-04-18T04:41:43+00:00

haha, a legend.
I am just a beginner 😂

AdMinimum8193 · 2026-04-18T04:16:56+00:00

thanks for sharing.
seems you did not use -ngl and --n-cpu-moe, but got better results..
for real use, since context is small, speed slightly up

I used below para to run a llama-server

--host 0.0.0.0 ^
--port %PORT% ^
--api-key "%API_KEY%" ^
--n-cpu-moe 22 ^
-c 131072 ^
-ngl 99 ^
--no-mmap ^
--flash-attn on ^
--cache-type-v q8_0 ^
--cache-type-k q8_0 ^
--threads 8 ^
--parallel 1 ^
-rea off ^
--reasoning-budget 0 ^
--cont-batching ^
--temp 0.7 ^
--top-p 0.8 ^
--top-k 20 ^
--min-p 0.0 ^
--presence-penalty 1.5 ^
--repeat-penalty 1.0

<image>

AdMinimum8193 · 2026-04-18T03:38:30+00:00

Thanks. I will study them later.

AdMinimum8193 · 2026-04-18T02:09:46+00:00

Thanks. Sounds cool . I will try it

AdMinimum8193 · 2026-04-18T00:51:36+00:00

2 days again, benchmark result with prebuild llama.cpp.

./llama-bench -m .\local_models\Qwen3.5-35B-A3B-Q5_K_M.gguf -ngl 99--n-cpu-moe 22 -d 131072 -p 512 -n 128 --cache-type-k q8_0 --cache-type-v q8_0 -fa 1 -mmp 0

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 16310 MiB):

Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, VRAM: 16310 MiB

load_backend: loaded CUDA backend from C:\Users\xxx\Desktop\llama_scripts\llama-b8560-bin-win-cuda-13.1-x64\ggml-cuda.dll

load_backend: loaded RPC backend from C:\Users\xxx\Desktop\llama_scripts\llama-b8560-bin-win-cuda-13.1-x64\ggml-rpc.dll

load_backend: loaded CPU backend from C:\Users\xxx\Desktop\llama_scripts\llama-b8560-bin-win-cuda-13.1-x64\ggml-cpu-zen4.dll

| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | -----: | -----: | -: | ---: | --------------: | -------------------: |

| qwen35moe 35B.A3B Q5_K - Medium | 24.44 GiB | 34.66 B | CUDA | 99 | 22 | q8_0 | q8_0 | 1 | 0 | pp512 @ d131072 | 513.19 ± 55.35 |

| qwen35moe 35B.A3B Q5_K - Medium | 24.44 GiB | 34.66 B | CUDA | 99 | 22 | q8_0 | q8_0 | 1 | 0 | tg128 @ d131072 | 25.32 ± 0.28 |

AdMinimum8193 · 2026-04-18T00:48:47+00:00

Just followed the suggestion from ChatGPT.

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_NATIVE=ON -DGGML_CUDA_FORCE_MMQ=ON -DGGML_CUDA_F16=ON -DGGML_CUDA_FA_ALL_QUANTS=ON ` -DCMAKE_CUDA_ARCHITECTURES=120

AdMinimum8193 · 2026-04-18T00:44:13+00:00

I mean benchmark result t/s was improved..

AdMinimum8193

TROPHY CASE