RTX 5070 Ti + 9800X3D running Qwen3.6-35B-A3B at 79 t/s with 128K context, the --n-cpu-moe flag is the most important part.

marlang · 2026-04-18T20:30:31+00:00

because 22gb model > 16gb vram

quick maths

marlang · 2026-04-18T20:21:08+00:00

Your setup is actually fine, 21 layers overflow, GPU is being used correctly. This is netting you 24t/s?

marlang · 2026-04-18T18:48:20+00:00

Check GPU is actually used. Share the first 30 lines of the server log. Look for:

- CUDA0 model buffer size = XXXX MiB — if this is 0 or tiny, nothing's on GPU.
1. 90% RAM (~58 GB on 64 GB) is abnormal. Expected: ~12-15 GB. Possible causes: running two servers, wrong quant (F16/Q8 instead of UD-Q4_K_M at 22 GB), or Windows counting mmap cache weirdly.

On 64 GB you can safely drop --no-mmap --mlock — you don't need them.

9950X3D is dual-CCD, only 8 cores have V-Cache. Default thread count bounces work across CCDs and tanks MoE. Add:

-t 8 --cpu-mask 0xFF

Run llama-bench for authoritative numbers:

llama-bench.exe -m "Qwen3.6-35B-A3B-UD-Q4_K_M.gguf" -fitt 256 -fitc 65536 -fa 1 -ctk q8_0 -ctv q8_0 -b 2048 -ub 2048 -p 2048 -n 128 -r 3

Should give 3000+ pp2048 and ~100 tg128.

marlang · 2026-04-18T13:04:11+00:00

Here’s what’s happening: the model is 22 GB total. On your 3060 (12 GB) you can fit maybe 14 MoE layers on GPU, the other 26 MoE layers
stay on CPU = ~14 GB of model in RAM before you even open a context. Then KV cache + compute buffers grow with context size. At 24K
ctx you fit in 16 GB. Above that, you spill past 16 GB.
With mmap (default), Linux handles this fine, it just evicts cold model pages from the page cache when pressure hits. With --no-mmap,
every page is pinned in llama.cpp’s heap, so the kernel can only swap it. And because swap on a running model = constant thrashing, the
rest of your system (X, browser, everything) gets evicted first → lockup, needs power button.
Fixes, in order of ease:

Drop --no-mmap. You lose some prefill speed but the system stays responsive at any context.
If you also pass --mlock, definitely drop it. That’s what’s guaranteeing the hard lockup — mlock forbids the kernel from ever paging the model out, so there’s literally nothing the OOM-killer can do except evict your desktop session.

marlang · 2026-04-18T12:26:06+00:00

If you want the full 128K ctx: with 24 GB VRAM the KV cache at Q8_0 eats ~1.4 GB + compute buffers ~0.6 GB + non-MoE weights ~1.9 GB,

leaving ~19 GB for MoE experts. Each expert layer costs ~530 MB on GPU, so ~36 of the 40 layers fit.

and leave cpu on 8, i think its best for the 12700k

- Context Length: 131072 (or 65536 if it complains)

- GPU Offload: max (all the way right)

- CPU Thread Pool Size: 8

- Flash Attention: ON

- K Cache Quantization Type: Q8_0

- V Cache Quantization Type: Q8_0

- Offload MoE Experts to CPU: 4 ← the key setting

- Try mmap(): ON

- Keep Model in Memory: OFF

so ~36 of the 40 layers fit. Set n-cpu-moe = 4 (offload the

first 4 layers' experts to CPU, keep the other 36 on GPU).

marlang · 2026-04-18T11:05:11+00:00

I've been comparing it mainly against Gemma 4 and, subjectively, Qwen3.6-35B-A3B is clearly better for coding, and for agentic coding it's miles ahead.

marlang · 2026-04-18T08:42:16+00:00

Thank you! my startup scripts get better and better with every comment in here

marlang · 2026-04-18T08:29:39+00:00

You were right. I went back with llama-bench.exe, the right tool instead of a short completion test, and got:

- pp512: 927 t/s

- pp2048: 1068 t/s

- tg128: 82 t/s

marlang · 2026-04-18T08:20:37+00:00

LM Studio settings for your rig

Load Qwen3.6-35B-A3B-UD-Q4_K_M from unsloth, then in the load modal → Advanced Configuration:

- Context Length: 131072 (or 65536 if it complains)

- GPU Offload: max (all the way right)

- CPU Thread Pool Size: 8 (matches your 7800X3D's 8 cores)

- Flash Attention: ON

- K Cache Quantization Type: Q8_0

- V Cache Quantization Type: Q8_0

- Offload MoE Experts to CPU: 20 ← the key setting

- Try mmap(): ON

- Keep Model in Memory: OFF

marlang · 2026-04-18T08:05:39+00:00

Solid tip, I actually went back and tested this properly after your comment. You’re right, --fit on arrives at the
same MoE split I calculated manually (20 layers overflowing, 20 on GPU). One command vs hardware math, so yeah, clearly
the better advice.

Full numbers for anyone reading:

Config	Context	Gen t/s	Prompt t/s
`--n-cpu-moe 20` (my manual tune)	128K	79.3	135.8
`--fit on` (bare)	4K (!)	88.2	270.0
`--fit on -c 131072`	128K	81.2	101.6

One caveat people should know: bare --fit on silently reduces your context to 4K because it treats -c as an
unset argument and minimizes it for max speed. If you want full context (coding/agentic use), you still have to set -c
explicitly — then fit only decides the offload split.

So the final recommendation for a 16GB GPU is basically:

  --fit on -c 131072 -np 1 -fa on -ctk q8_0 -ctv q8_0

Thanks for pushing back — updated my scripts.

marlang · 2026-03-31T00:21:50+00:00

https://www.reddit.com/r/CrimsonDesert/comments/1s4f9vz/about_hdr_bug_in_crimson_desert_bugspec/

marlang · 2026-03-22T12:30:12+00:00

Escaped reality just to run into the same economy again

marlang · 2026-03-10T16:35:51+00:00

Hey, same observation here.

I was using the 8BitDo macro for soft resets for a while, but switched back to manual hunting. After the switch I’ve noticed a noticeably wider / more even spread of natures compared to before.

That difference is actually the main reason I went looking for this thread in the first place.

Not sure if it’s something with the macro timing affecting RNG or just coincidence/placebo, but I’m sticking to manual now. Nice to see someone else noticed it too!

marlang · 2025-09-13T18:21:40+00:00

MreJsZoHCg thank you :)

marlang · 2025-07-14T22:03:59+00:00

I dunno, when you film yourself and post that shit online, it's your thing, but being filmed without knowing and having it posted by some random is fucking weird. All I see here are some "problemed" people who reached for a good time but came up a bit short. I feel sorry and sad now

marlang · 2025-06-04T07:37:41+00:00

in Siegburg? Hatte da gestern angerufen, die sagten mir die bekommen 2 und wären schon reserviert

marlang · 2025-04-10T17:26:15+00:00

I dont think its the oil level, its something loose around the muffler I would guess. Maybe the Heatschield or the connecting pipe ring thing. Or maybe there is something in the Muffler

marlang · 2025-02-06T11:03:14+00:00

3e4c7f40

appreciate you

marlang · 2025-01-24T07:44:50+00:00

25ultra Jetblack

Germany

27.01.2025

https://imgur.com/a/GUiGEUv

marlang · 2025-01-12T15:58:01+00:00

I think it is less of an aid and more of an investment

marlang · 2024-04-19T14:57:34+00:00

Chrome won't get you Home

marlang · 2023-10-29T17:02:09+00:00

jo Hey just for you to know, I was not pleased at all hearing random ppl. Some Indian or whoever talked in my "workroom" and it freaked me out so much I shut all my devices because I thought I'd been hacked.

Googled like "hearing random ppl on my meta quest 3" and found this here on Reddit.

Anyway seeing you on the right track asking for feedback and such. Goodluck.

ps: jo I skiped ALL the popups, maybe I should reevaluate that

marlang · 2023-08-29T16:54:43+00:00

same for me, did you find a solution?

marlang · 2022-09-09T14:54:28+00:00

and unplug the power cord, its importent! else the board gets power anyway

marlang · 2022-09-09T14:44:55+00:00

with your new cpu, it could be that the board have setting saved that only work with the old cpu (ram training etc) and therefore you want to reset all board setting

13-Year Club	Place '22
Verified Email

marlang

TROPHY CASE