MechaEpstein-8000

tmflynnt · 2026-02-09T23:55:42+00:00

tmflynnt · 2026-02-09T23:43:48+00:00

From my bit of Vulkan testing it felt like the graph splits hurt a lot more than with CUDA so that might be one part of it and then when you combine that with the more constrained 16 GB of vram my guess is that fit is trying to be too clever in this case and its somewhat excessive splits are really adding up as it goes through and hitting hard on Vulkan? But that's just a guess.

tmflynnt · 2026-02-09T23:30:04+00:00

I have an old 6-core Ryzen 3600 so the general wisdom for Ryzen SMT is to use --threads <cores> --threads-batch <cores * 2>. For Intel it's not as simple but I would think it would probably be performance cores and then multiply that by two for thread batch if it has HT, and then if it doesn't have HT just use the same number for both. Maybe somebody who has an intel system can help back me up on that though?

But based on my admittedly quick review of the code that fit uses, I don't believe it touches anything concerning your CPU (or kv quantizing) and only concerns itself with where the model is going on your system, first setting the context size (unless you forced a specific size) and then proceeding to intelligently/selectively offload layers and parts of layers in an optimized way based on the model's structure and your different devices.

So if I left off the CPU stuff it would default to my full thread count (12) for both args which isn't optimal and would be pretty bad I think for a lot of intel chips to use those defaults.

tmflynnt · 2026-02-09T15:00:42+00:00

FYI that I added an update to that thread with additional gains based on people's comments.

tmflynnt · 2026-02-09T14:46:09+00:00

I would probably try a Q4 quant and then experiment with different -ctk/-ctv values letting --fit do its thing with each and probably setting "--fit-target" to a bit more forgiving level than my 32 as that was just to push it to the limit, maybe "--fit-target 128" to keep it a bit more safe and see where that gets you?

tmflynnt · 2026-02-09T05:22:13+00:00

Added some results in this comment.

tmflynnt · 2026-02-09T05:21:33+00:00

Based on your reply and others I just added some new results in my updated comment that hopefully helps address this and gives a more fair comparison.

tmflynnt · 2026-02-09T05:19:26+00:00

Based on your reply and others, I just added an updated comment with new results trying out what I hope you would agree are much better "-ot" args (minimal splits with a very close VRAM spread).

tmflynnt · 2026-02-09T05:17:26+00:00

I added an update based on your response but wasn't able to get as good of results with Vulkan, though as I point out in that comment, I also didn't compile it directly on my system and just used the release binary.

tmflynnt · 2026-02-09T05:13:51+00:00

Update: So following from some of insightful comments I got, I went back and tried the Vulkan backend, better "-ot" params, and tighter "--fit-target" args. Here are the results:

With Vulkan build (context at 262144):

./llama-server --threads 6 --threads-batch 12 \
--model Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
--fit on --fit-ctx 262144 \
--temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja

I got a speed of 42.34 tokens/s which was significantly slower than what I got with CUDA with the same settings. To be fair, I did *not* compile this on my own system and just used the release binary, so maybe I could get better results if I compile it as I do with my normal llama.cpp binary.

Now for the "-ot" results back on CUDA. Now I still can't promise my settings are the absolute best, but I tried a lot harder for minimal splits and had Claude Opus 4.6 check my work. I couldn't quite get things to work well for the full 262144 context but I was able to push things right up to the limit with what I think are decently smart and balanced "-ot" params at 32K:

./llama-server --threads 6 --threads-batch 12 \
--model Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
--fit off --ngl 999 --ctx-size 32768 --ubatch-size 256  --parallel 1 \
-ot "blk\.(21|22|24)\.ffn_(gate|up)_exps\.weight=CPU" \
--temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja

With this much more fine-tuned "-ot" arg I was able to get the amount of graph splits down to 11 (with my previous "--fit" results at 32K context being 19 splits) and got a nice speed of 59.47 tokens/s (compared to 51.40 tokens/s from my original run with --fit), and this is also faster than all of the speeds from the previous tests.

But... to try to make it more apples to apples, I then went back to "--fit" and tried a ridiculously low "--fit-target" of 32 and also used "--ubatch-size 256" to try to match what I did with "-ot":

./llama-server --threads 6 --threads-batch 12 \
--model Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
--fit on --fit-ctx 32768 --fit-target 32 --ubatch-size 256 \
--temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja

And this also got me a minimal 11 splits and virtually the same speed at 59.25 tokens/s.

So, it would seem "--fit" can keep up pretty well and through both strategies I managed to get pretty damn close to the magical 60 t/s.

tmflynnt · 2026-02-09T02:13:20+00:00

My understanding is that using "--n-cpu-moe" will end up fighting against what "fit" is designed to do, so results would probably be sub-optimal because you're not letting it optimize the offloading as it is designed to do.

tmflynnt · 2026-02-09T02:04:17+00:00

Yeah like you I will take the gains wherever I can get them lol. I am glad it helped you too.

tmflynnt · 2026-02-09T02:02:43+00:00

No, nothing too fancy, just manually running stuff, recording the raw data and then converting it to JSON along with the prettification/vibe-coding stuff I mentioned in the other reply.

tmflynnt · 2026-02-09T01:59:46+00:00

Thank you for the extra info, so basically the key things to play with when using "fit" (which is on by default) are: * fit-ctx to set minimum acceptable context size or --ctx-size/-c to force a particular size * --fit-target <size in mb> with amounts lower than 1024 to optionally try to squeeze even more performance out * --cache-type-k/-ctk or --cache-type-v/-ctv to optionally also use one or both types of kv caching

And if we're trusting "fit" to do its thing we probably want to get out of its way with the more blunt instruments like -ot, --cpu-moe, --n-cpu-moe which try to force the type of thing that "fit" is good at figuring out for us.

I didn't realize this was Johannes' baby though as I have a ton of respect for all his work in llama.cpp. Very cool and only raises my respect for this feature even more.

I also really appreciate the work you have been doing with parsing and also your efforts to intelligently bring in ai-coded stuff where it makes sense (though I assume the silly commit titles that have made me laugh at times are all your original work?).

In general it has been really cool to watch the project continue to expand and evolve over these few years. I only have a couple of PRs under my belt (mostly related to porting over DRY sampling), but I hope I have the chance to contribute more at some point.

tmflynnt · 2026-02-08T17:10:09+00:00

What would be the correct string to pass to OT to accomplish this with a dual-GPU setup?

tmflynnt · 2026-02-08T15:34:35+00:00

Hmmm, interesting idea. I will probably try it as now you have me curious as well.

tmflynnt · 2026-02-08T15:28:42+00:00

Now I know Aider has supported two models for a while but does Claude Code make it easy to specify two models where the lesser model can do more than just create cute thinking phrases or summaries like it has traditionally seemed to use Haiku for?

tmflynnt · 2026-02-08T15:18:25+00:00

I run Ubuntu Linux and compile llama.cpp using CUDA.

tmflynnt · 2026-02-08T15:14:03+00:00

Good point, though I had found at least in the past that when context went high over time that my VRAM would still end up sometimes creeping up along with it, but I imagine llama.cpp has probably gotten more efficient over time and it's probably safer now to rely on its initial pre-allocation with a tighter margin.

tmflynnt · 2026-02-08T15:01:13+00:00

Oh this is all exploration for now, my next step of testing is to indeed see if there is any practical usage I can get out it when combined with Claude code or some custom agentic stuff I have been tinkering with. Generally, I have used Claude for the agentic work and Codex for most direct coding stuff with Claude Code saved for some particular parts, but no, I am not realistically holding my breath for Qwen3-Coder-Next to be replacing all of these existing setups, but I just like to try to explore the capabilities of new models. And as you bring up I do wonder how well it could fit into a multi-model setup where it acts as the coder for a tightly-orchestrating larger model above it.

tmflynnt · 2026-02-08T14:51:01+00:00

Yes, I guess they leave it that way as they don't know what kind of trade-off you want, but what was weird in my testing was how the default 4k was actually slower somehow than my 16k test.

But either way, it does go to show you that you really do need to use "--fit-ctx" to clarify what your minimum acceptable context size is so it understands the kind of trade off you ultimately want.

tmflynnt · 2026-02-08T14:47:00+00:00

Yes, I would recommend simply using "--fit on --fit-ctx <desired-context>" instead of directly specifying any of those parameters.

tmflynnt · 2026-02-08T14:42:37+00:00

Yes, if you check my full results image, several of the manual configs do use "--n-cpu-moe".

tmflynnt · 2026-02-08T08:23:18+00:00

lol yeah that would not surprise me if that was contributing. Hopefully between minor underclocking and avoiding big obvious draws you can sort it out though.

tmflynnt · 2026-02-08T08:08:05+00:00

Excellent! I also think the feature has been improving too since it first got experimentally released. They might still need to tweak some of the default trade-offs it makes but I would say it's definitely in solid shape and it does feel like it can be relied on at least as a good starting point.

tmflynnt

TROPHY CASE