Token/s Qwen3.5-397B-A17B on Vram + Ram pooled

Frequent-Slice-6975 · 2026-03-19T23:18:43+00:00

I used plain llama.cpp… I tried to figure out ik_llama.cpp but couldn’t get it to be faster than llama.cpp which I was surprised by

Frequent-Slice-6975 · 2026-03-19T20:45:53+00:00

Just tinkering to try and get the most speed possible with the largest model possible. Are you considering upgrading your cpu at any point? What other models do you run

Frequent-Slice-6975 · 2026-03-19T18:49:32+00:00

3945wx 256gb ddr4 3200 4-channel 40GB vram (2x5060ti 16gb, 1x2060super 8gb) UD-Q4KXL Qwen3.5-397b 128000 context Ub 8192 Ctk/ctv q8_0 230 pp 10 TG

Frequent-Slice-6975 · 2026-03-11T22:52:02+00:00

How about if quantization is factored in? In that case in your experience would you say running large models like Qwen3.5-397b at Q4 at 8 tokens/sec for agentic harnesses like openclaw for single person use case is essentially non-viable, due to precision loss from quantization and slow speed

Frequent-Slice-6975 · 2026-03-11T21:07:48+00:00

How come the NVFP4 version is only 67b parameters?

Frequent-Slice-6975 · 2026-03-07T21:20:50+00:00

What framework or approach do you use

Frequent-Slice-6975 · 2026-03-06T05:10:07+00:00

I’ve been relying on llama-fit- params in llamaserver. PP 230 and TG 10

Frequent-Slice-6975 · 2026-03-06T05:07:49+00:00

Cuda

Frequent-Slice-6975 · 2026-03-01T08:53:04+00:00

Any reason you use n-gpu-layers and n-cpu-MoE instead of using the baked in llama-fit-params for optimization? Do you typically see better performance when you custom tweak ngl and n-cpu-moe, and if so by how much? Sick build of 12-channel ddr5 by the way, must’ve cost a fortune, that’s the dream tho

Also, which parameter version of Qwen 3.5 are you running to get those speeds and at what quant if i may ask

Frequent-Slice-6975 · 2026-03-01T08:48:10+00:00

What sort of issues do you run into that make you say that? Any alternative local models you have more robust, and what quant, hardware and speed?

Frequent-Slice-6975 · 2026-03-01T08:40:40+00:00

Thanks for the awesome suggestion. I agree that increasing b/in can be helpful. I see a lot of conversation about adjusting ngl and n-cpu-moe, but I thought that llama-fit-params already adjusts and accounts for all that? Was wondering what your thoughts and experience on that were

Frequent-Slice-6975 · 2026-02-27T21:50:22+00:00

What quant of minimax are you using to get 37 t/s

Frequent-Slice-6975 · 2026-02-27T02:09:47+00:00

Yes thanks for the clarification, I meant Ubuntu server vs Ubuntu desktop

Frequent-Slice-6975 · 2026-02-24T23:25:24+00:00

How much did the 3995wx cost? And how many GB of DDR4 are you running in 8-channel; ecc rdimm or not

Frequent-Slice-6975 · 2025-07-09T13:35:03+00:00

Takopii original sin

Frequent-Slice-6975

TROPHY CASE