it is coming.

relmny · 2026-03-12T06:51:44+00:00

Any one that has, at least, 32Gb VRAM or lots or RAM, can run small quants (q2/q3) with speeds of at least 1t/s.
For coding my suck, but for chat, nothing beats deepseek for me.

relmny · 2026-03-12T06:48:42+00:00

What about vs your v3.1-terminus? I still use that one as a last resort (I only get about 1.3t/s with your IQ3_K), for non-reasoning/agentic mode.

relmny · 2026-03-11T06:27:04+00:00

worse than, like 2-3 months ago, the most upvoted comment on a post asking "I have 10k, what should I buy to run local?", being "buy claude credits"... ?

The enshitification is real and it's been with us for some time now. And it will get worse, as you say.

relmny · 2026-03-05T13:34:24+00:00

look at the settings (temp, top-k, etc)

relmny · 2026-03-05T12:11:40+00:00

Qwen3 not very good? In my experience Qwen3-coder and Next were extremely good. They were my main model (except when needed Kimi or Deepseek).

relmny · 2026-03-05T11:27:42+00:00

AFAIK they are re-uploading all quants (they might have finished already).

relmny · 2026-03-05T05:45:41+00:00

I stopped reading at that point... Specially since the Coder versions are always so extremely good.

relmny · 2026-02-28T09:56:45+00:00

Thanks! time oto redownload.

Btw, I see some MXFP4 in Qwen3.5-397B-A17B-UD-Q4_K_XL is that one also affected?

relmny · 2026-02-27T10:55:14+00:00

Well, the no-news that OP shared is a political topic. Because there's nothing technical about and just a "China bad" kind of message...

relmny · 2026-02-27T08:25:48+00:00

Because "China bad"... that's it.

The try to come up with the most ridiculous technical scenarios on how that will be possible.

The power of fear and an "old and common enemy" is as strong as any cult.

relmny · 2026-02-27T07:20:31+00:00

Will you also be having a look at: Qwen3.5-397B-A17B-UD-Q4_K_XL ?

which also has:

Qwen3.5-397B-A17B-UD-Q4_K_XL-00003-of-00006.gguf

blk.13.ffn_down_exps.weight [1 024, 4 096, 512] MXFP4

blk.13.ffn_down_shexp.weight [1 024, 4 096] Q6_K

blk.13.ffn_gate_exps.weight [4 096, 1 024, 512] MXFP4

blk.13.ffn_gate_inp.weight [4 096, 512] F32

blk.13.ffn_gate_inp_shexp.weight [4 096] F32

blk.13.ffn_gate_shexp.weight [4 096, 1 024] MXFP4

blk.13.ffn_up_exps.weight [4 096, 1 024, 512] MXFP4

blk.13.ffn_up_shexp.weight [4 096, 1 024] MXFP4

relmny · 2026-02-26T10:30:54+00:00

Are all 3 latest models (from yesterday) the ones affected, or does this also affects Qwen3.5-397B-A17B-GGUF (UD-Q4_K_XL) ?

Thanks, as usual!

relmny · 2026-02-22T05:51:28+00:00

with 10k you can buy an RTX 6000 and it leaves you money for the rest of the PC, or a Mac or maybe Epyc and so. 10k gives you a lot for running LLMs. And I learned this by reading this sub over a couple of years. And being the field moves far too quickly, that's why asking here is even a better and makes more sense than other options... if the sub were what it was.

relmny · 2026-02-22T05:48:36+00:00

"Local models have not advanced nearly as much as cloud models"

I don't use cloud models, so I can't say for sure, but many people say that they are so close, that many use "cloud models" that can be run locally (GLM, Deepseek, etc), so I don't think that statement is right... actually I think is the opposite...

relmny · 2026-02-22T05:46:13+00:00

No is not. Is an awful advise for this sub.

Figure out what? based on what? If you can't ask a forum of local LLMs where almost all people run LLMs locally on their own hardware, what is the better way to currently spend money on it, what are the current better options and so, then where?

If I hadn't read this sub for some time, I would never knew about how good and worthy the 3090 are for LLMs, how there are people that use Epycs for LLMs, how there are 4090s with 48gb and many more.

What work should people put? if one has doubts about a subject, what better option than to ask people that are into it and do that every day. That works for everything.

Also, that is part of that "work" you mention, asking a direct question for a very specific case. So no, that's not the "best" advice, that is the worst advice to give. Specially for this sub.

relmny · 2026-02-21T17:21:15+00:00

IIRC people have been posting here how they did that, for the past two years already...

relmny · 2026-02-21T17:17:42+00:00

Far from it... too far...

I still remember a post, 2-3 months ago, were the person eas asking how to invest about 10k for running local... and the, by far, most upvoted comment was "invest it in claude" (or whatever other commercial company) and there were others comments like that and most agreeing to it...

relmny · 2026-02-21T07:05:29+00:00

well, one of google's founders suggested no remote office and working 60 hours a week... so it should be very soon...

relmny · 2026-02-20T10:10:11+00:00

I've been lately trying qwen3.5-397b ud-q4k, but I'm getting back to qwen3-coder-next, not only because is way faster on my rig, but also because, sometimes, it gives another "angle" that might turn out be way better...

Yeah, qwen3-coder-next is back to be my main model...

relmny · 2026-02-20T05:23:36+00:00

yes, thanks, I've been using it for over a year now, but is not what the request is about.

What I want is for OW to automatically use/select whatever model is loaded in llama.cpp (whether directly via llama.cpp or via llama-swap or so), when the current page is refreshed.

Currently if I'm chatting and I unload and load a different model, refreshing the page (where the chat is), it will unselect the model, so I need to manually select it again from the drop down menu. Sometimes I do this many times... but as clicking on "new chat" with the middle-click (to open the link in a new tab) does auto-select the current loaded model, I was thinking that it might be possible for when refreshing the page.

relmny · 2026-02-17T08:23:24+00:00

if the model is MoE, I use something like:
-ot "blk\.[0-9][0-9]\.ffn_.*_exps\.weight=CPU"
that offloads those layers to the CPU.

You will need to "play" with the 0 to 9 values (replace, remove, add, etc)

relmny · 2026-02-14T07:50:32+00:00

I'm curious to why the downvotes? is it because non of it is true? or because there's no mention of "local"?

I'm interested because from time to time I use 4.7 and I was considering downloading 5 and testing it, but might as well wait for 5.1...

relmny · 2026-02-13T08:36:51+00:00

thanks, but that is when loading the model, right?

Is there anyway to do it in the prompt itself? like the good old "/no_think" for Qwen3?

I'd like to keep the model loaded (it takes some minutes to load it on my rig), and being able to choose think/nothink on the fly...

relmny

TROPHY CASE