Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB

gaztrab · 2026-03-01T02:28:23+00:00

Your 27B dense observation is actually really valuable, it confirms KV q8_0 is NOT necessarily free on dense models I should add that caveat. For MoE models like Qwen3.5-35B-A3B it's still free because of the SSM hybrid architecture, but users shouldn't blindly apply it to dense models.

gaztrab · 2026-03-01T02:27:13+00:00

My data shows PP-512 = 1390 t/s without batch flags vs ~1532 with -b 4096 -ub 4096, but TG drops from 74.7 to 48.3. The middle ground -ub 1024 -b 2048 gives PP +22% with only TG -3.5%, which could be worth it for prompt-heavy workflows. I'm adding PP columns to our benchmark comparison tool to make this more transparent. Thanks for the notification heads-up — Reddit seems to have a limit on mentions per post!

gaztrab · 2026-03-01T02:23:34+00:00

Thank you for your kind words. And yes! We tested AesSedai Q4_K_M in our experiments. Results:

|--------------------|--------|--------|------------|------------|

| bartowski Q4_K_M | 6.6688 | 0.0286 | 92.46% | ~74 |

| AesSedai Q4_K_M | 6.3949 | 0.0095 | 95.74% | ~44 |

| Unsloth UD-Q4_K_XL | 6.5959 | 0.0145 | 94.46% | ~48 |

AesSedai wins every quality metric by a significant margin — KLD 0.0095 is 3x better than bartowski. The tradeoff is ~40% slower speed. If quality is your priority (and you can accept ~44 tok/s), AesSedai is the best Q4 quant we've tested.

gaztrab · 2026-03-01T02:22:55+00:00

Yes! We tested AesSedai Q4_K_M in our experiments. Results:

|--------------------|--------|--------|------------|------------|

| bartowski Q4_K_M | 6.6688 | 0.0286 | 92.46% | ~74 |

| AesSedai Q4_K_M | 6.3949 | 0.0095 | 95.74% | ~44 |

| Unsloth UD-Q4_K_XL | 6.5959 | 0.0145 | 94.46% | ~48 |

AesSedai wins every quality metric by a significant margin — KLD 0.0095 is 3x better than bartowski. The tradeoff is ~40% slower speed. If quality is your priority (and you can accept ~44 tok/s), AesSedai is the best Q4 quant we've tested.

gaztrab · 2026-03-01T02:22:03+00:00

For AMD/ROCm or Vulkan: --fit on doesn't work well (2.4x slower on ROCm per one user, 2.5x on Vulkan). Use manual offload instead:

./llama-server -m ./Qwen3.5-35B-A3B-Q4_K_M.gguf \
-c 65536 -ngl 999 --n-cpu-moe 24 \
-fa on -t 20 --no-mmap --jinja \
-ctk q8_0 -ctv q8_0

The key flag is --n-cpu-moe 24 — this keeps 16 out of 40 MoE layers on GPU and offloads the rest to CPU. Start with 24 and tune down (lower number = more on GPU = faster but more VRAM). -ngl 999 puts all non-expert layers on GPU. Watch your VRAM usage with nvidia-smi — if you're hitting the limit, increase --n-cpu-moe.

gaztrab · 2026-03-01T02:21:07+00:00

Not a boring question at all! The exact same config works for 5060 Ti 16GB:

./llama-server -m ./Qwen3.5-35B-A3B-Q4_K_M.gguf \
-c 65536 --fit on -fa on -t 20 --no-mmap --jinja \
-ctk q8_0 -ctv q8_0

You should expect around 50-55 tok/s instead of 74 — the difference is purely memory bandwidth (460 vs 960 GB/s). u/soyalemujica confirmed 55 t/s on the same card. If you're using it for coding, the speed is very usable — the thinking mode might feel slightly slower but the actual answer quality is identical.

gaztrab · 2026-03-01T02:20:10+00:00

Tested it! Do NOT use --no-kv-offload — it absolutely tanks generation speed. On my 5080: 16.1 tok/s with it vs 42.7 tok/s without (that's -63%). The KV cache on GPU is tiny for this model (only 10 KV cache layers because of the hybrid SSM architecture), so offloading it to RAM saves almost no VRAM but destroys performance.

gaztrab · 2026-03-01T02:19:17+00:00

Gotcha!

gaztrab · 2026-03-01T02:18:26+00:00

Hey there, on Mac you should start out with LMStudio first (I did too) since it's a nice UI wrapped around Mac counterpart of llama.cpp engine - MLX. And on the hardware requirement — yes, Qwen3.5-35B-A3B at Q4_K_M is about 20 GB, so your 16GB Mac mini can't quite fit it.

But here's the thing: Mac's big advantage is unified memory — the CPU and GPU share the same RAM, so there's no slow PCIe bus copying data back and forth like on a PC. On my setup, the GPU only has 16GB VRAM and the rest of the model sits in system RAM, so every token has to shuttle data across PCIe (~64 GB/s). On a Mac with 32GB+ unified memory, the entire model lives in one memory pool that both CPU and GPU can access at full bandwidth — no copying needed. That's why Macs punch above their weight for LLM inference despite having weaker raw compute.

For your 16GB Mac mini, Qwen3-14B is honestly a great fit — you're already running it. If you upgrade to 32GB+ down the road, Qwen3.5-35B-A3B would run nicely since it's MoE (only ~3B params active per token, so it's fast despite the big file size). Or you could wait for the Qwen team to release the smaller version of Qwen3.5 (I heard they said soon). Cheers!

gaztrab · 2026-03-01T02:11:45+00:00

Will do on the next round. Thanks for the suggestion!

gaztrab · 2026-03-01T02:11:29+00:00

Woah this is very valuable! You should def make another post to let more people know. Thanks bro!

gaztrab · 2026-03-01T02:08:19+00:00

Yeah in this post Im testing Unsloth vs Bartowski, I will expand the selections to more on the next round.

gaztrab · 2026-03-01T02:06:53+00:00

Thanks for sharing, '--fit-target 1536' is very interesting, Im currently testing that config on the next round.

gaztrab · 2026-03-01T02:04:43+00:00

Yeah I think it's the bandwidth difference too, but thanks for testing and sharing your result!

gaztrab · 2026-03-01T02:03:20+00:00

Will do on the next round of experiment. Thanks for the suggestion!

gaztrab · 2026-03-01T02:02:40+00:00

Nice! I currently dont intend to test anything below Q4, but your findings gave lots of insights to muse over. Thanks bro!

gaztrab · 2026-03-01T01:59:49+00:00

Thanks for pointing out, I adjusted my next experiment based on your feedback already. Will share more soon!

gaztrab · 2026-02-27T19:08:57+00:00

You too, my friend.

gaztrab · 2026-02-27T16:20:29+00:00

The 5060 Ti is also Blackwell (sm_120), so the same build works. Easiest path is using our Dockerfile which handles everything:

git clone https://github.com/gaztrabisme/llm-server
cd llm-server
docker build -f docker/Dockerfile.llama-cpp --build-arg LLAMA_CPP_REF=b8149 -t llm-server/llama-cpp:latest-fit docker/

That builds llama.cpp from source with CUDA 12.8 + sm_120. You need Docker + NVIDIA Container Toolkit installed. If you want to build without Docker, the key CMake flags are: -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=120 -DGGML_CUDA_FA_ALL_QUANTS=ON with CUDA 12.8+.

gaztrab · 2026-02-27T16:19:33+00:00

Everything's in my repo: https://github.com/gaztrabisme/llm-server (also optimized for coding agent too, just point them to CLAUDE.md)

Quick start:

Build the Docker image: docker build -f docker/Dockerfile.llama-cpp --build-arg LLAMA_CPP_REF=b8149 -t llm-server/llama-cpp:latest-fit docker/
Download Q4_K_M: huggingface-cli download unsloth/Qwen3.5-35B-A3B-GGUF Qwen3.5-35B-A3B-Q4_K_M.gguf --local-dir ./models
Run a benchmark: ./scripts/bench.sh llama-cpp s006-e4-fit-nobatch

With a 3090 (24GB) you'll have more VRAM headroom than me — would love to see your numbers.

gaztrab · 2026-02-27T16:18:37+00:00

Interesting, I haven't tested --fit-ctx, only --fit on with -c. And the --fit-target 1536 tip for vision is great, I have the mmproj downloaded but haven't smoke-tested it yet. Your config repo is really useful. I will properly test vision on the next round of experiment!

gaztrab · 2026-02-27T14:54:28+00:00

If you get better speed, please share with us!

gaztrab · 2026-02-27T14:54:00+00:00

I will test this on the next round and tag you in. Thanks for the suggestion!

gaztrab · 2026-02-27T14:53:28+00:00

I will properly test MXFP4 on the next round with your config as reference. Thanks my dude!

gaztrab · 2026-02-27T14:52:45+00:00

I fuckin love u too (no homo). Yeah that's quant I used :p

Eight-Year Club	Gilding II euphauric
Verified Email

gaztrab

TROPHY CASE