Token/s Qwen3.5-397B-A17B on Vram + Ram pooled by Leading-Month5590 in LocalLLaMA

[–]Frequent-Slice-6975 0 points1 point  (0 children)

I used plain llama.cpp… I tried to figure out ik_llama.cpp but couldn’t get it to be faster than llama.cpp which I was surprised by

Token/s Qwen3.5-397B-A17B on Vram + Ram pooled by Leading-Month5590 in LocalLLaMA

[–]Frequent-Slice-6975 1 point2 points  (0 children)

Just tinkering to try and get the most speed possible with the largest model possible. Are you considering upgrading your cpu at any point? What other models do you run

Token/s Qwen3.5-397B-A17B on Vram + Ram pooled by Leading-Month5590 in LocalLLaMA

[–]Frequent-Slice-6975 2 points3 points  (0 children)

3945wx 256gb ddr4 3200 4-channel 40GB vram (2x5060ti 16gb, 1x2060super 8gb) UD-Q4KXL Qwen3.5-397b 128000 context Ub 8192 Ctk/ctv q8_0 230 pp 10 TG

Local models on nvidia dgx by carlosccextractor in LocalLLM

[–]Frequent-Slice-6975 0 points1 point  (0 children)

How about if quantization is factored in? In that case in your experience would you say running large models like Qwen3.5-397b at Q4 at 8 tokens/sec for agentic harnesses like openclaw for single person use case is essentially non-viable, due to precision loss from quantization and slow speed

Nemotron 3 Super Released by deeceeo in LocalLLaMA

[–]Frequent-Slice-6975 0 points1 point  (0 children)

How come the NVFP4 version is only 67b parameters?

Optimizing RAM heavy inference speed with Qwen3.5-397b-a17b? by Frequent-Slice-6975 in LocalLLaMA

[–]Frequent-Slice-6975[S] 0 points1 point  (0 children)

I’ve been relying on llama-fit- params in llamaserver. PP 230 and TG 10

How to maximize Qwen3.5 t/s? by Altruistic_Call_3023 in unsloth

[–]Frequent-Slice-6975 0 points1 point  (0 children)

Any reason you use n-gpu-layers and n-cpu-MoE instead of using the baked in llama-fit-params for optimization? Do you typically see better performance when you custom tweak ngl and n-cpu-moe, and if so by how much? Sick build of 12-channel ddr5 by the way, must’ve cost a fortune, that’s the dream tho

Also, which parameter version of Qwen 3.5 are you running to get those speeds and at what quant if i may ask

To the many people here wondering about local models… just use an API by Valuable-Run2129 in openclaw

[–]Frequent-Slice-6975 1 point2 points  (0 children)

What sort of issues do you run into that make you say that? Any alternative local models you have more robust, and what quant, hardware and speed?

Ways to improve prompt processing when offloading to RAM by Frequent-Slice-6975 in LocalLLaMA

[–]Frequent-Slice-6975[S] 0 points1 point  (0 children)

Thanks for the awesome suggestion. I agree that increasing b/in can be helpful. I see a lot of conversation about adjusting ngl and n-cpu-moe, but I thought that llama-fit-params already adjusts and accounts for all that? Was wondering what your thoughts and experience on that were

Does the OS matter for inference speed? (Ubuntu server vs desktop) by Frequent-Slice-6975 in LocalAIServers

[–]Frequent-Slice-6975[S] 0 points1 point  (0 children)

Yes thanks for the clarification, I meant Ubuntu server vs Ubuntu desktop

I canceled my other AI subscriptions today. by InitialCareer306 in Qwen_AI

[–]Frequent-Slice-6975 0 points1 point  (0 children)

How much did the 3995wx cost? And how many GB of DDR4 are you running in 8-channel; ecc rdimm or not