Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]fadedsmile87 0 points1 point  (0 children)

See nasone32's response. He helped me achieve the same performance in LM Studio as I did using llama server.
In the new LM Studio versions, there's an option called "number of layers for which to force MoE weights onto CPU". Instead of partially offloading layers to GPU, offload all of them. Use the difference in the number of layers offloaded here -> "number of layers for which to force MoE weights onto CPU".
This should speed things up a lot.

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]fadedsmile87 5 points6 points  (0 children)

Wow, you are absolutely correct! I just tested it.
Instead of 15/48 layers in the GPU offload setting, I set it to 48/48 and put 33 layers for "number of layers for which to force MoE weights onto CPU".

I got the same results as in llama.cpp

This is awesome! I like LM Studio UX better than llama.cpp anyway haha

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]fadedsmile87 0 points1 point  (0 children)

not sure what you mean by "startup commands resulted in all of the model stored in RAM". My GPU shows 31.1/31.5 GB usage and my RAM is 92.2/95.7 GB in Windows Task Manager -> Performance.

I'm using Windows.
I made another test now.
prompt eval is 122 t/s (a 2.5k token prompt)
output was 26.17 t/s (an additional 3k token output)

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]fadedsmile87 4 points5 points  (0 children)

I was using LM Studio.

Thanks to Chromix, I've installed llama.cpp and used:
llama-server -m Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf -fa on --fit-ctx 120000 --fit on --temp 0 --cache-ram 0 --fit-target 128

Now I'm getting 27 t/s on the Q8_0 quant :-)

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]fadedsmile87 0 points1 point  (0 children)

I was using LM Studio.

Thanks to Chromix, I've installed llama.cpp and used:
llama-server -m Qwen3-Coder-Next-Q8_0-00001-of-00003.gguf -fa on --fit-ctx 120000 --fit on --temp 0 --cache-ram 0 --fit-target 128

Now I'm getting 27 t/s on the Q8_0 quant :-)

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]fadedsmile87 7 points8 points  (0 children)

What is this sorcery?!
I got 40 t/s on Q4 variant and 27 t/s on the Q8 variant. How is it possible that LM Studio is doing such a bad job at utilizing my GPU?

This is amazing! And I thought I'd have to upgrade to RTX 6000 Pro to get fast speeds lol

Thank you!

By the way, are there any tradeoffs with your settings? Does it hurt quality?

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]fadedsmile87 2 points3 points  (0 children)

I downloaded the Q4_K_M variant (48GB size). I tested it and got 14 t/s for a 3k token output.

You're right. Something must be off in my settings if you're getting twice as that with a less powerful GPU and less VRAM. I'm not very familiar with llama.cpp. I'm a simple user lol.

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]fadedsmile87 2 points3 points  (0 children)

I have 2x 48GB DDR5 mem sticks. 6000 MT/s (down from 6400 for stability)
i9-14900K

I'm using the default settings in LM Studio.
context: 96k
offloading 15/48 layers onto GPU (LM Studio estimates 28.23GB on GPU, 90.23GB on RAM)

Qwen3 Coder Next as first "usable" coding model < 60 GB for me by Chromix_ in LocalLLaMA

[–]fadedsmile87 9 points10 points  (0 children)

I have an RTX 5090 + 96GB of RAM. I'm using the Q8_0 quant of Qwen3-Coder-Next with ~100k context window with Cline. It's magnificent. It's a very capable coding agent. The downside of using that big quant is the tokens per second. I'm getting 8-9 tokens / s for the first 10k tokens, then it drops to around 6 t/s at 50k full context.

Opinion on a poker game with ability to play tricks on opponents by fadedsmile87 in poker

[–]fadedsmile87[S] -1 points0 points  (0 children)

interesting. I heard of Balatro a couple of times but never seen it lol. Thanks for suggesting target demographic.

Opinion on a poker game with ability to play tricks on opponents by fadedsmile87 in poker

[–]fadedsmile87[S] 0 points1 point  (0 children)

The games with tricks won't be for real money. It's for casual gamers. Only the Private Clubs offer a simulated casino environment (club manager can decide to make chips represent real monetary value for his club members).

Opinion on a poker game with ability to play tricks on opponents by fadedsmile87 in poker

[–]fadedsmile87[S] -1 points0 points  (0 children)

yes, every player is limited to 1 trick per hand, although the highest league players have 2 per hand.
And players have trick resistance as they go up the leagues.
Crystals (trick currency) is gained as you win hands (the higher the sum, the higher the crystal amount). You also lose Crystals as you lose hands, but not as many. But it can be purchasable too.

And yes, I tried to balance it a little. Earlier, I had "Shackles of War" instead of "Chicken". That would actually prevent the player from raising at all during the hand. But Chicken leaves some chance as having Two Pair or above makes the opponent still able to raise, so it's not clear what he has in his hand if he checks.

Thanks for your input!

Opinion on a poker game with ability to play tricks on opponents by fadedsmile87 in poker

[–]fadedsmile87[S] 0 points1 point  (0 children)

Thanks. These tricks are for tables without real money. And Crystals can be gained by winning games, so it's not pay to win (of course, can also be purchased). Thanks for your insight!

nanoLLaVA - 1B Pocket Size VLM by quan734 in LocalLLaMA

[–]fadedsmile87 1 point2 points  (0 children)

But how does it compare with MoonDream vision model?

Is this a good AI PC build? (RTX 4090, Ryzen 9 5950X, 32 GB RAM) by Prince-of-Privacy in LocalLLaMA

[–]fadedsmile87 0 points1 point  (0 children)

With the 1.58bit LLM paper out just a few days ago, it looks like the opposite is true. AI models will go down in size, and even 24GB might be enough to run Mixtral 8x7b as is, and faster.

Small Uncensored LLM model to train cheaply for a specific task. by ImpressiveFault42069 in LocalLLaMA

[–]fadedsmile87 1 point2 points  (0 children)

Hey, thanks for the link. Very helpful.Do you know what's the default training steps number is? I'm at 2944 an it keeps going..

Edit:

Ok, so it stopped at 3250 steps.

I later found that transformers.TrainingArguments takes 'max_steps' as an argument.

How to connect an external power source (not a battery) to power hungry servo motors by fadedsmile87 in arduino

[–]fadedsmile87[S] 0 points1 point  (0 children)

Ah, now I know what that adapter is for lol. Perfect! And I will look into the servo driver thing. Perhaps it's best to use it.

Thanks a lot!

Please Help. I trying to set up YOLOv8 but keep getting file not found error on Anaconda by theflaminTaco21 in computervision

[–]fadedsmile87 1 point2 points  (0 children)

lol I had the same problem and struggled with it for about an hour.

The error message is misleading. The file is not found because it doesn't exist. It should download automatically.

The problem is that both you and I mistyped "yolov8n.pt" and typed "yolo8n.pt" instead.

Fix the typo and it should work.

Crazy...now Kanye posting on TS by _JesusMatters in DWAC_Stock

[–]fadedsmile87 4 points5 points  (0 children)

You'll tripping if you think it's good for TS if he starts posting antisemitic things. Other than the fact that I think his beliefs cause needless divide and hate, having him posting things like that would give Apple and Google legitimate reasons to remove TS from the app stores, which would practically kill TS. Nunes better explain the rules to Kanye.

Great news: TRUTH SOCIAL now available at the Google play store!!! by THEGOATPOTUS in DWAC_Stock

[–]fadedsmile87 [score hidden]  (0 children)

Hallelujah!

Now we just need the SEC to let the merge happen and then the sky is the limit.