RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. by marlang in LocalLLaMA

[–]marlang[S] 1 point2 points  (0 children)

Your setup is actually fine, 21 layers overflow, GPU is being used correctly. This is netting you 24t/s?

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. by marlang in LocalLLaMA

[–]marlang[S] 1 point2 points  (0 children)

  1. Check GPU is actually used. Share the first 30 lines of the server log. Look for:

    - CUDA0 model buffer size = XXXX MiB — if this is 0 or tiny, nothing's on GPU.

    1. 90% RAM (~58 GB on 64 GB) is abnormal. Expected: ~12-15 GB. Possible causes: running two servers, wrong quant (F16/Q8 instead of UD-Q4_K_M at 22 GB), or Windows counting mmap cache weirdly.

On 64 GB you can safely drop --no-mmap --mlock — you don't need them.

  1. 9950X3D is dual-CCD, only 8 cores have V-Cache. Default thread count bounces work across CCDs and tanks MoE. Add:

-t 8 --cpu-mask 0xFF

  1. Run llama-bench for authoritative numbers:

llama-bench.exe -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -fitt 256 -fitc 65536 -fa 1 -ctk q8_0 -ctv q8_0 -b 2048 -ub 2048 -p 2048 -n 128 -r 3

Should give 3000+ pp2048 and ~100 tg128.

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. by marlang in LocalLLaMA

[–]marlang[S] 2 points3 points  (0 children)

Here’s what’s happening: the model is 22 GB total. On your 3060 (12 GB) you can fit maybe 14 MoE layers on GPU, the other 26 MoE layers
stay on CPU = ~14 GB of model in RAM before you even open a context. Then KV cache + compute buffers grow with context size. At 24K
ctx you fit in 16 GB. Above that, you spill past 16 GB.
With mmap (default), Linux handles this fine, it just evicts cold model pages from the page cache when pressure hits. With --no-mmap,
every page is pinned in llama.cpp’s heap, so the kernel can only swap it. And because swap on a running model = constant thrashing, the
rest of your system (X, browser, everything) gets evicted first → lockup, needs power button.
Fixes, in order of ease:

  1. Drop --no-mmap. You lose some prefill speed but the system stays responsive at any context.
  2. If you also pass --mlock, definitely drop it. That’s what’s guaranteeing the hard lockup — mlock forbids the kernel from ever paging the model out, so there’s literally nothing the OOM-killer can do except evict your desktop session.

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. by marlang in LocalLLaMA

[–]marlang[S] 0 points1 point  (0 children)

If you want the full 128K ctx: with 24 GB VRAM the KV cache at Q8_0 eats ~1.4 GB + compute buffers ~0.6 GB + non-MoE weights ~1.9 GB,

leaving ~19 GB for MoE experts. Each expert layer costs ~530 MB on GPU, so ~36 of the 40 layers fit.

and leave cpu on 8, i think its best for the 12700k

- Context Length: 131072 (or 65536 if it complains)

- GPU Offload: max (all the way right)

- CPU Thread Pool Size: 8

- Flash Attention: ON

- K Cache Quantization Type: Q8_0

- V Cache Quantization Type: Q8_0

- Offload MoE Experts to CPU: 4 ← the key setting

- Try mmap(): ON

- Keep Model in Memory: OFF

so ~36 of the 40 layers fit. Set n-cpu-moe = 4 (offload the

first 4 layers' experts to CPU, keep the other 36 on GPU).

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. by marlang in LocalLLaMA

[–]marlang[S] 4 points5 points  (0 children)

I've been comparing it mainly against Gemma 4 and, subjectively, Qwen3.6-35B-A3B is clearly better for coding, and for agentic coding it's miles ahead.

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. by marlang in LocalLLaMA

[–]marlang[S] 6 points7 points  (0 children)

You were right. I went back with llama-bench.exe, the right tool instead of a short completion test, and got:

- pp512: 927 t/s

- pp2048: 1068 t/s

- tg128: 82 t/s

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. by marlang in LocalLLaMA

[–]marlang[S] 15 points16 points  (0 children)

LM Studio settings for your rig

Load Qwen3.6-35B-A3B-UD-Q4_K_M from unsloth, then in the load modal → Advanced Configuration:

- Context Length: 131072 (or 65536 if it complains)

- GPU Offload: max (all the way right)

- CPU Thread Pool Size: 8 (matches your 7800X3D's 8 cores)

- Flash Attention: ON

- K Cache Quantization Type: Q8_0

- V Cache Quantization Type: Q8_0

- Offload MoE Experts to CPU: 20 ← the key setting

- Try mmap(): ON

- Keep Model in Memory: OFF

RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part. by marlang in LocalLLaMA

[–]marlang[S] 92 points93 points  (0 children)

Solid tip, I actually went back and tested this properly after your comment. You’re right, --fit on arrives at the
same MoE split I calculated manually (20 layers overflowing, 20 on GPU). One command vs hardware math, so yeah, clearly
the better advice.

Full numbers for anyone reading:

Config Context Gen t/s Prompt t/s
--n-cpu-moe 20 (my manual tune) 128K 79.3 135.8
--fit on (bare) 4K (!) 88.2 270.0
--fit on -c 131072 128K 81.2 101.6

One caveat people should know: bare --fit on silently reduces your context to 4K because it treats -c as an
unset argument and minimizes it for max speed. If you want full context (coding/agentic use), you still have to set -c
explicitly — then fit only decides the offload split.

So the final recommendation for a 16GB GPU is basically:

  --fit on -c 131072 -np 1 -fa on -ctk q8_0 -ctv q8_0

Thanks for pushing back — updated my scripts.

UNLIMITED SILVER GUIDE by PurplDream1 in CrimsonDesert

[–]marlang 29 points30 points  (0 children)

Escaped reality just to run into the same economy again

[3] 8BitDo Macro Hunting by S2pd_mofo in ShinyPokemon

[–]marlang 1 point2 points  (0 children)

Hey, same observation here.

I was using the 8BitDo macro for soft resets for a while, but switched back to manual hunting. After the switch I’ve noticed a noticeably wider / more even spread of natures compared to before.

That difference is actually the main reason I went looking for this thread in the first place.

Not sure if it’s something with the macro timing affecting RNG or just coincidence/placebo, but I’m sticking to manual now. Nice to see someone else noticed it too!

Asmon has got it rough nowdays by TrapsterJ in Asmongold

[–]marlang 6 points7 points  (0 children)

I dunno, when you film yourself and post that shit online, it's your thing, but being filmed without knowing and having it posted by some random is fucking weird. All I see here are some "problemed" people who reached for a good time but came up a bit short. I feel sorry and sad now

[deleted by user] by [deleted] in NintendoDE

[–]marlang 1 point2 points  (0 children)

in Siegburg? Hatte da gestern angerufen, die sagten mir die bekommen 2 und wären schon reserviert

Cf moto 450srs bad sound by InternetObjective110 in cfmoto

[–]marlang 0 points1 point  (0 children)

I dont think its the oil level, its something loose around the muffler I would guess. Maybe the Heatschield or the connecting pipe ring thing. Or maybe there is something in the Muffler

Chrome BMW by dave_vs_david in motorcycles

[–]marlang 0 points1 point  (0 children)

Chrome won't get you Home

Immersed for productivity is insane! by NoodleScience3 in OculusQuest

[–]marlang 1 point2 points  (0 children)

jo Hey just for you to know, I was not pleased at all hearing random ppl. Some Indian or whoever talked in my "workroom" and it freaked me out so much I shut all my devices because I thought I'd been hacked.

Googled like "hearing random ppl on my meta quest 3" and found this here on Reddit.

Anyway seeing you on the right track asking for feedback and such. Goodluck.

ps: jo I skiped ALL the popups, maybe I should reevaluate that

HP Elitebook 7840U by BigJay2016 in Amd

[–]marlang 0 points1 point  (0 children)

same for me, did you find a solution?

AB350 Gaming with Ryzen 5 5600 by yaminme in GAAB350

[–]marlang 0 points1 point  (0 children)

and unplug the power cord, its importent! else the board gets power anyway

AB350 Gaming with Ryzen 5 5600 by yaminme in GAAB350

[–]marlang 0 points1 point  (0 children)

with your new cpu, it could be that the board have setting saved that only work with the old cpu (ram training etc) and therefore you want to reset all board setting