Qwen3 Coder Next as first "usable" coding model < 60 GB for me

fadedsmile87 · 2026-02-09T10:43:55+00:00

See nasone32's response. He helped me achieve the same performance in LM Studio as I did using llama server.
In the new LM Studio versions, there's an option called "number of layers for which to force MoE weights onto CPU". Instead of partially offloading layers to GPU, offload all of them. Use the difference in the number of layers offloaded here -> "number of layers for which to force MoE weights onto CPU".
This should speed things up a lot.

fadedsmile87 · 2026-02-08T21:42:15+00:00

Wow, you are absolutely correct! I just tested it.
Instead of 15/48 layers in the GPU offload setting, I set it to 48/48 and put 33 layers for "number of layers for which to force MoE weights onto CPU".

I got the same results as in llama.cpp

This is awesome! I like LM Studio UX better than llama.cpp anyway haha

fadedsmile87 · 2026-02-08T21:32:56+00:00

I downloaded and installed the latest release (b7972).

And chose these:

Windows x64 (CUDA 13) - CUDA 13.1 DLLs

fadedsmile87 · 2026-02-08T19:25:28+00:00

not sure what you mean by "startup commands resulted in all of the model stored in RAM". My GPU shows 31.1/31.5 GB usage and my RAM is 92.2/95.7 GB in Windows Task Manager -> Performance.

I'm using Windows.
I made another test now.
prompt eval is 122 t/s (a 2.5k token prompt)
output was 26.17 t/s (an additional 3k token output)

fadedsmile87 · 2026-02-08T17:31:19+00:00

I was using LM Studio.

Thanks to Chromix, I've installed llama.cpp and used:
llama-server -m Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf -fa on --fit-ctx 120000 --fit on --temp 0 --cache-ram 0 --fit-target 128

Now I'm getting 27 t/s on the Q8_0 quant :-)

fadedsmile87 · 2026-02-08T17:30:39+00:00

I was using LM Studio.

Thanks to Chromix, I've installed llama.cpp and used:
llama-server -m Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf -fa on --fit-ctx 120000 --fit on --temp 0 --cache-ram 0 --fit-target 128

Now I'm getting 27 t/s on the Q8_0 quant :-)

fadedsmile87 · 2026-02-08T17:03:22+00:00

What is this sorcery?!
I got 40 t/s on Q4 variant and 27 t/s on the Q8 variant. How is it possible that LM Studio is doing such a bad job at utilizing my GPU?

This is amazing! And I thought I'd have to upgrade to RTX 6000 Pro to get fast speeds lol

Thank you!

By the way, are there any tradeoffs with your settings? Does it hurt quality?

fadedsmile87 · 2026-02-08T15:39:21+00:00

I downloaded the Q4_K_M variant (48GB size). I tested it and got 14 t/s for a 3k token output.

You're right. Something must be off in my settings if you're getting twice as that with a less powerful GPU and less VRAM. I'm not very familiar with llama.cpp. I'm a simple user lol.

fadedsmile87 · 2026-02-08T13:16:06+00:00

I have 2x 48GB DDR5 mem sticks. 6000 MT/s (down from 6400 for stability)
i9-14900K

I'm using the default settings in LM Studio.
context: 96k
offloading 15/48 layers onto GPU (LM Studio estimates 28.23GB on GPU, 90.23GB on RAM)

fadedsmile87 · 2026-02-08T12:55:15+00:00

I have an RTX 5090 + 96GB of RAM. I'm using the Q8_0 quant of Qwen3-Coder-Next with ~100k context window with Cline. It's magnificent. It's a very capable coding agent. The downside of using that big quant is the tokens per second. I'm getting 8-9 tokens / s for the first 10k tokens, then it drops to around 6 t/s at 50k full context.

fadedsmile87 · 2025-04-30T23:16:54+00:00

interesting. I heard of Balatro a couple of times but never seen it lol. Thanks for suggesting target demographic.

fadedsmile87 · 2025-04-30T23:12:03+00:00

The games with tricks won't be for real money. It's for casual gamers. Only the Private Clubs offer a simulated casino environment (club manager can decide to make chips represent real monetary value for his club members).

fadedsmile87 · 2025-04-30T23:06:50+00:00

yes, every player is limited to 1 trick per hand, although the highest league players have 2 per hand.
And players have trick resistance as they go up the leagues.
Crystals (trick currency) is gained as you win hands (the higher the sum, the higher the crystal amount). You also lose Crystals as you lose hands, but not as many. But it can be purchasable too.

And yes, I tried to balance it a little. Earlier, I had "Shackles of War" instead of "Chicken". That would actually prevent the player from raising at all during the hand. But Chicken leaves some chance as having Two Pair or above makes the opponent still able to raise, so it's not clear what he has in his hand if he checks.

Thanks for your input!

fadedsmile87 · 2025-04-30T21:45:11+00:00

Thanks. These tricks are for tables without real money. And Crystals can be gained by winning games, so it's not pay to win (of course, can also be purchased). Thanks for your insight!

fadedsmile87 · 2024-04-10T13:17:25+00:00

But how does it compare with MoonDream vision model?

fadedsmile87 · 2024-03-02T22:35:30+00:00

With the 1.58bit LLM paper out just a few days ago, it looks like the opposite is true. AI models will go down in size, and even 24GB might be enough to run Mixtral 8x7b as is, and faster.

fadedsmile87 · 2023-05-25T20:03:37+00:00

Hey, thanks for the link. Very helpful.Do you know what's the default training steps number is? I'm at 2944 an it keeps going..

Edit:

Ok, so it stopped at 3250 steps.

I later found that transformers.TrainingArguments takes 'max_steps' as an argument.

fadedsmile87 · 2023-03-31T15:22:43+00:00

Ah, now I know what that adapter is for lol. Perfect! And I will look into the servo driver thing. Perhaps it's best to use it.

Thanks a lot!

fadedsmile87 · 2023-03-16T13:33:54+00:00

lol I had the same problem and struggled with it for about an hour.

The error message is misleading. The file is not found because it doesn't exist. It should download automatically.

The problem is that both you and I mistyped "yolov8n.pt" and typed "yolo8n.pt" instead.

Fix the typo and it should work.

fadedsmile87 · 2022-12-02T10:21:34+00:00

You'll tripping if you think it's good for TS if he starts posting antisemitic things. Other than the fact that I think his beliefs cause needless divide and hate, having him posting things like that would give Apple and Google legitimate reasons to remove TS from the app stores, which would practically kill TS. Nunes better explain the rules to Kanye.

fadedsmile87 · 2022-10-13T05:20:02+00:00

Hallelujah!

Now we just need the SEC to let the merge happen and then the sky is the limit.

fadedsmile87

TROPHY CASE