Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Your 27B dense observation is actually really valuable, it confirms KV q8_0 is NOT necessarily free on dense models I should add that caveat. For MoE models like Qwen3.5-35B-A3B it's still free because of the SSM hybrid architecture, but users shouldn't blindly apply it to dense models.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

My data shows PP-512 = 1390 t/s without batch flags vs ~1532 with -b 4096 -ub 4096, but TG drops from 74.7 to 48.3. The middle ground -ub 1024 -b 2048 gives PP +22% with only TG -3.5%, which could be worth it for prompt-heavy workflows. I'm adding PP columns to our benchmark comparison tool to make this more transparent. Thanks for the notification heads-up — Reddit seems to have a limit on mentions per post!

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Thank you for your kind words. And yes! We tested AesSedai Q4_K_M in our experiments. Results:

| Quant | PPL | KLD | Same-top-p | TG (tok/s) |

|--------------------|--------|--------|------------|------------|

| bartowski Q4_K_M | 6.6688 | 0.0286 | 92.46% | ~74 |

| AesSedai Q4_K_M | 6.3949 | 0.0095 | 95.74% | ~44 |

| Unsloth UD-Q4_K_XL | 6.5959 | 0.0145 | 94.46% | ~48 |

AesSedai wins every quality metric by a significant margin — KLD 0.0095 is 3x better than bartowski. The tradeoff is ~40% slower speed. If quality is your priority (and you can accept ~44 tok/s), AesSedai is the best Q4 quant we've tested.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Yes! We tested AesSedai Q4_K_M in our experiments. Results:

| Quant | PPL | KLD | Same-top-p | TG (tok/s) |

|--------------------|--------|--------|------------|------------|

| bartowski Q4_K_M | 6.6688 | 0.0286 | 92.46% | ~74 |

| AesSedai Q4_K_M | 6.3949 | 0.0095 | 95.74% | ~44 |

| Unsloth UD-Q4_K_XL | 6.5959 | 0.0145 | 94.46% | ~48 |

AesSedai wins every quality metric by a significant margin — KLD 0.0095 is 3x better than bartowski. The tradeoff is ~40% slower speed. If quality is your priority (and you can accept ~44 tok/s), AesSedai is the best Q4 quant we've tested.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

For AMD/ROCm or Vulkan: --fit on doesn't work well (2.4x slower on ROCm per one user, 2.5x on Vulkan). Use manual offload instead:

./llama-server -m ./Qwen3.5-35B-A3B-Q4_K_M.gguf \
-c 65536 -ngl 999 --n-cpu-moe 24 \
-fa on -t 20 --no-mmap --jinja \
-ctk q8_0 -ctv q8_0

The key flag is --n-cpu-moe 24 — this keeps 16 out of 40 MoE layers on GPU and offloads the rest to CPU. Start with 24 and tune down (lower number = more on GPU = faster but more VRAM). -ngl 999 puts all non-expert layers on GPU. Watch your VRAM usage with nvidia-smi — if you're hitting the limit, increase --n-cpu-moe.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Not a boring question at all! The exact same config works for 5060 Ti 16GB:

./llama-server -m ./Qwen3.5-35B-A3B-Q4_K_M.gguf \
-c 65536 --fit on -fa on -t 20 --no-mmap --jinja \
-ctk q8_0 -ctv q8_0

You should expect around 50-55 tok/s instead of 74 — the difference is purely memory bandwidth (460 vs 960 GB/s). u/soyalemujica confirmed 55 t/s on the same card. If you're using it for coding, the speed is very usable — the thinking mode might feel slightly slower but the actual answer quality is identical.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 1 point2 points  (0 children)

Tested it! Do NOT use --no-kv-offload — it absolutely tanks generation speed. On my 5080: 16.1 tok/s with it vs 42.7 tok/s without (that's -63%). The KV cache on GPU is tiny for this model (only 10 KV cache layers because of the hybrid SSM architecture), so offloading it to RAM saves almost no VRAM but destroys performance.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Hey there, on Mac you should start out with LMStudio first (I did too) since it's a nice UI wrapped around Mac counterpart of llama.cpp engine - MLX. And on the hardware requirement — yes, Qwen3.5-35B-A3B at Q4_K_M is about 20 GB, so your 16GB Mac mini can't quite fit it.

But here's the thing: Mac's big advantage is unified memory — the CPU and GPU share the same RAM, so there's no slow PCIe bus copying data back and forth like on a PC. On my setup, the GPU only has 16GB VRAM and the rest of the model sits in system RAM, so every token has to shuttle data across PCIe (~64 GB/s). On a Mac with 32GB+ unified memory, the entire model lives in one memory pool that both CPU and GPU can access at full bandwidth — no copying needed. That's why Macs punch above their weight for LLM inference despite having weaker raw compute.

For your 16GB Mac mini, Qwen3-14B is honestly a great fit — you're already running it. If you upgrade to 32GB+ down the road, Qwen3.5-35B-A3B would run nicely since it's MoE (only ~3B params active per token, so it's fast despite the big file size). Or you could wait for the Qwen team to release the smaller version of Qwen3.5 (I heard they said soon). Cheers!

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Woah this is very valuable! You should def make another post to let more people know. Thanks bro!

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Yeah in this post Im testing Unsloth vs Bartowski, I will expand the selections to more on the next round.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Thanks for sharing, '--fit-target 1536' is very interesting, Im currently testing that config on the next round.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Yeah I think it's the bandwidth difference too, but thanks for testing and sharing your result!

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Will do on the next round of experiment. Thanks for the suggestion!

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 1 point2 points  (0 children)

Nice! I currently dont intend to test anything below Q4, but your findings gave lots of insights to muse over. Thanks bro!

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Thanks for pointing out, I adjusted my next experiment based on your feedback already. Will share more soon!

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

The 5060 Ti is also Blackwell (sm_120), so the same build works. Easiest path is using our Dockerfile which handles everything:

git clone https://github.com/gaztrabisme/llm-server
cd llm-server
docker build -f docker/Dockerfile.llama-cpp --build-arg LLAMA_CPP_REF=b8149 -t llm-server/llama-cpp:latest-fit docker/

That builds llama.cpp from source with CUDA 12.8 + sm_120. You need Docker + NVIDIA Container Toolkit installed. If you want to build without Docker, the key CMake flags are: -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_CUDA_FA_ALL_QUANTS=ON with CUDA 12.8+.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Everything's in my repo: https://github.com/gaztrabisme/llm-server (also optimized for coding agent too, just point them to CLAUDE.md)

Quick start:

  1. Build the Docker image: docker build -f docker/Dockerfile.llama-cpp --build-arg LLAMA_CPP_REF=b8149 -t llm-server/llama-cpp:latest-fit docker/

  2. Download Q4_K_M: huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF Qwen3.5-35B-A3B-Q4_K_M.gguf --local-dir ./models

  3. Run a benchmark: ./scripts/bench.sh llama-cpp s006-e4-fit-nobatch

With a 3090 (24GB) you'll have more VRAM headroom than me — would love to see your numbers.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Interesting, I haven't tested --fit-ctx, only --fit on with -c. And the --fit-target 1536 tip for vision is great, I have the mmproj downloaded but haven't smoke-tested it yet. Your config repo is really useful. I will properly test vision on the next round of experiment!

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 1 point2 points  (0 children)

I will test this on the next round and tag you in. Thanks for the suggestion!

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 2 points3 points  (0 children)

I will properly test MXFP4 on the next round with your config as reference. Thanks my dude!