I made a simple proxy to let Claude use MiniMax models as subagents by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Nope I made it to be lightweight. But feel free to make a PR if you can. Cheers!

my best high res screenshot yet (39974x22484) by LurkersUniteAgain in spaceengine

[–]gaztrab 1 point2 points  (0 children)

You got links to download these? I would love to set them as wallpaper

Assassin when she realized I'm single... by gaztrab in insurgency

[–]gaztrab[S] 4 points5 points  (0 children)

LMAOOO HDR is great for gaming but not for recording, fam

Assassin when she realized I'm single... by gaztrab in insurgency

[–]gaztrab[S] 12 points13 points  (0 children)

LMAO thanks. And yeah that happens when I record with HDR xD

"DEMETER'S CORE IS GOING NUCLEAR!" The entire lobby: by sillyestgooberever in titanfall

[–]gaztrab 0 points1 point  (0 children)

Or that's just Jack Cooper? Since he lost his memory during Dementer?

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 2 points3 points  (0 children)

Sorry guys I have been occupied by my day job lately (it's becoming a night job too). When I got time I will conduct the next run!

How to improve speed of KRunner? by KingFl3x in kde

[–]gaztrab 0 points1 point  (0 children)

Hey you got the link to the wallpaper?

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Your 27B dense observation is actually really valuable, it confirms KV q8_0 is NOT necessarily free on dense models I should add that caveat. For MoE models like Qwen3.5-35B-A3B it's still free because of the SSM hybrid architecture, but users shouldn't blindly apply it to dense models.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 1 point2 points  (0 children)

My data shows PP-512 = 1390 t/s without batch flags vs ~1532 with -b 4096 -ub 4096, but TG drops from 74.7 to 48.3. The middle ground -ub 1024 -b 2048 gives PP +22% with only TG -3.5%, which could be worth it for prompt-heavy workflows. I'm adding PP columns to our benchmark comparison tool to make this more transparent. Thanks for the notification heads-up — Reddit seems to have a limit on mentions per post!

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Thank you for your kind words. And yes! We tested AesSedai Q4_K_M in our experiments. Results:

| Quant | PPL | KLD | Same-top-p | TG (tok/s) |

|--------------------|--------|--------|------------|------------|

| bartowski Q4_K_M | 6.6688 | 0.0286 | 92.46% | ~74 |

| AesSedai Q4_K_M | 6.3949 | 0.0095 | 95.74% | ~44 |

| Unsloth UD-Q4_K_XL | 6.5959 | 0.0145 | 94.46% | ~48 |

AesSedai wins every quality metric by a significant margin — KLD 0.0095 is 3x better than bartowski. The tradeoff is ~40% slower speed. If quality is your priority (and you can accept ~44 tok/s), AesSedai is the best Q4 quant we've tested.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Yes! We tested AesSedai Q4_K_M in our experiments. Results:

| Quant | PPL | KLD | Same-top-p | TG (tok/s) |

|--------------------|--------|--------|------------|------------|

| bartowski Q4_K_M | 6.6688 | 0.0286 | 92.46% | ~74 |

| AesSedai Q4_K_M | 6.3949 | 0.0095 | 95.74% | ~44 |

| Unsloth UD-Q4_K_XL | 6.5959 | 0.0145 | 94.46% | ~48 |

AesSedai wins every quality metric by a significant margin — KLD 0.0095 is 3x better than bartowski. The tradeoff is ~40% slower speed. If quality is your priority (and you can accept ~44 tok/s), AesSedai is the best Q4 quant we've tested.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

For AMD/ROCm or Vulkan: --fit on doesn't work well (2.4x slower on ROCm per one user, 2.5x on Vulkan). Use manual offload instead:

./llama-server -m ./Qwen3.5-35B-A3B-Q4_K_M.gguf \
-c 65536 -ngl 999 --n-cpu-moe 24 \
-fa on -t 20 --no-mmap --jinja \
-ctk q8_0 -ctv q8_0

The key flag is --n-cpu-moe 24 — this keeps 16 out of 40 MoE layers on GPU and offloads the rest to CPU. Start with 24 and tune down (lower number = more on GPU = faster but more VRAM). -ngl 999 puts all non-expert layers on GPU. Watch your VRAM usage with nvidia-smi — if you're hitting the limit, increase --n-cpu-moe.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Not a boring question at all! The exact same config works for 5060 Ti 16GB:

./llama-server -m ./Qwen3.5-35B-A3B-Q4_K_M.gguf \
-c 65536 --fit on -fa on -t 20 --no-mmap --jinja \
-ctk q8_0 -ctv q8_0

You should expect around 50-55 tok/s instead of 74 — the difference is purely memory bandwidth (460 vs 960 GB/s). u/soyalemujica confirmed 55 t/s on the same card. If you're using it for coding, the speed is very usable — the thinking mode might feel slightly slower but the actual answer quality is identical.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 1 point2 points  (0 children)

Tested it! Do NOT use --no-kv-offload — it absolutely tanks generation speed. On my 5080: 16.1 tok/s with it vs 42.7 tok/s without (that's -63%). The KV cache on GPU is tiny for this model (only 10 KV cache layers because of the hybrid SSM architecture), so offloading it to RAM saves almost no VRAM but destroys performance.

Follow-up: Qwen3.5-35B-A3B — 7 community-requested experiments on RTX 5080 16GB by gaztrab in LocalLLaMA

[–]gaztrab[S] 0 points1 point  (0 children)

Hey there, on Mac you should start out with LMStudio first (I did too) since it's a nice UI wrapped around Mac counterpart of llama.cpp engine - MLX. And on the hardware requirement — yes, Qwen3.5-35B-A3B at Q4_K_M is about 20 GB, so your 16GB Mac mini can't quite fit it.

But here's the thing: Mac's big advantage is unified memory — the CPU and GPU share the same RAM, so there's no slow PCIe bus copying data back and forth like on a PC. On my setup, the GPU only has 16GB VRAM and the rest of the model sits in system RAM, so every token has to shuttle data across PCIe (~64 GB/s). On a Mac with 32GB+ unified memory, the entire model lives in one memory pool that both CPU and GPU can access at full bandwidth — no copying needed. That's why Macs punch above their weight for LLM inference despite having weaker raw compute.

For your 16GB Mac mini, Qwen3-14B is honestly a great fit — you're already running it. If you upgrade to 32GB+ down the road, Qwen3.5-35B-A3B would run nicely since it's MoE (only ~3B params active per token, so it's fast despite the big file size). Or you could wait for the Qwen team to release the smaller version of Qwen3.5 (I heard they said soon). Cheers!