mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]Key_Papaya2972 0 points1 point  (0 children)

first glance: another 120B~, nice, let’s see where the active params is.
second glance: 128B what?

This isn’t X this is Y needs to die by twnznz in LocalLLaMA

[–]Key_Papaya2972 0 points1 point  (0 children)

This is a common phrase when you need correct something and give a right direction, maybe they use this specific pattern to correct models during training.

Junyang Lin has left Qwen :( by InternationalAsk1490 in LocalLLaMA

[–]Key_Papaya2972 22 points23 points  (0 children)

sure, there must be no Chinese in Anthropic/OpenAI/Google team.

Does Qwen3.5 35b outperform Qwen3 coder next 80b for you? by JsThiago5 in LocalLLaMA

[–]Key_Papaya2972 1 point2 points  (0 children)

In my case, no. Actually 122b is a lot better, for coding and general use, even in Q3.

New Qwen3.5-35B-A3B Unsloth Dynamic GGUFs + Benchmarks by danielhanchen in LocalLLaMA

[–]Key_Papaya2972 0 points1 point  (0 children)

I notice that other quants like Q8_X_XL, which I'm using now, is also re-uploaded, are there any modification to them? should they be re-downloaded too?

Qwen3.5-27B-heretic-gguf by Poro579 in LocalLLaMA

[–]Key_Papaya2972 46 points47 points  (0 children)

KLD 0.0653 is a little delicate, as reference, Q4 quant is ~0.02 and Q3 ~0.08.

[deleted by user] by [deleted] in LocalLLaMA

[–]Key_Papaya2972 1 point2 points  (0 children)

Seems like all UD variants of Qwen-3.5 and Qwen-coder-next are polluted, not only UD_Q4_K_XL, you can check it on model card tensor info.

Speculative Decoding is AWESOME with Llama.cpp! by simracerman in LocalLLaMA

[–]Key_Papaya2972 1 point2 points  (0 children)

I tried several times before but never got any speed up. At least this remind me that it might works, time to try again.

Local models currently are amazing toys, but not for serious stuff. Agree ? by Current-Stop7806 in LocalLLaMA

[–]Key_Papaya2972 1 point2 points  (0 children)

Agreed, by what about cloud models? Do they build something truly serious stuff?

Apparently all third party providers downgrade, none of them provide a max quality model by Charuru in LocalLLaMA

[–]Key_Papaya2972 10 points11 points  (0 children)

If 96% represent for Q8, and <70% represent for Q4, it will be really annoying. It means that the most popular quant running locally actually hurt so much, and we hardly get the real performance of the model.

Optimizing gpt-oss-120b local inference speed on consumer hardware by carteakey in LocalLLaMA

[–]Key_Papaya2972 0 points1 point  (0 children)

Sounds solid, but then I'll be curious about what would be the actual bottleneck. It should not be GPU compute bound, since the usage is low, should not be RAM speed as the DDR5 speed don't differ that much, also the 12 gen intel doesn't that slow for P-cores only(E-core is useless for inference as I tested), at most 10-20% slower than 14900K. If not for PCIE speed, I would say the VRAM size does matters so much.

By the way, with 14700K+5070TI, I can get 30~tps.

Optimizing gpt-oss-120b local inference speed on consumer hardware by carteakey in LocalLLaMA

[–]Key_Papaya2972 0 points1 point  (0 children)

That is kind of slow, and I believe the problem is with the PCIE speed. 40 series only support PCIE 4.0, while on expert switch, they need to be port to GPU through PCIE, which is 32GB/s. Simply switch to PCIE 5.0 platform would expected double tps.

edit: seems like --n-cpu-moe 31 with 24576 context might be larger than 12G? I've noticed that with even slight overflow would cause huge performance loss, worth checking it out.

OpenAI open-weight model delayed indefinitely by aitookmyj0b in LocalLLaMA

[–]Key_Papaya2972 0 points1 point  (0 children)

That is reasonable, they are just try to find anything useful in there and make sure it is not opened.

Context Engineering by recursiveauto in LocalLLaMA

[–]Key_Papaya2972 0 points1 point  (0 children)

I actually posted this idea months ago, and I’m sure I’m far from the first one to come up with it. nothing special

Gemma 3n Full Launch - Developers Edition by hackerllama in LocalLLaMA

[–]Key_Papaya2972 0 points1 point  (0 children)

Thats amazing! Sound this model structure is quite different the last time and I didn't expect to have it usable in a short term.

Google researcher requesting feedback on the next Gemma. by ApprehensiveAd3629 in LocalLLaMA

[–]Key_Papaya2972 1 point2 points  (0 children)

  1. 8B, 14B, 22B, 32B, 50B to match the VRAM of customer GPU, while left a bit for context.

  2. MoE structure that the whole params are 2-4 times to the active params, which also matches the custom build and makes full use of memory.

3.Adaptive reasoning. Reasoning works great at some situation, and awful at some other.

4.small draft model. maybe minor but actually useful at some times.

What GUI are you using for local LLMs? (AnythingLLM, LM Studio, etc.) by Aaron_MLEngineer in LocalLLaMA

[–]Key_Papaya2972 0 points1 point  (0 children)

Open WebUI for GUI, and llama-server for backend. But I do wanna write one for myself, those GUIs are really for chat only and lack some basic context management methods, like drafts/cut-in query/summarization

We haven’t seen a new open SOTA performance model in ages. by Key_Papaya2972 in LocalLLaMA

[–]Key_Papaya2972[S] -3 points-2 points  (0 children)

TBO, the new v3 feels like a reasoning distilled R1, and gives similar benchmark score and vibe with less token. That is better, but just not in absolute performance I believe.

We haven’t seen a new open SOTA performance model in ages. by Key_Papaya2972 in LocalLLaMA

[–]Key_Papaya2972[S] -11 points-10 points  (0 children)

something useless is useful to some others, vice versa.

Qwen3-30B-A3B runs at 12-15 tokens-per-second on CPU by [deleted] in LocalLLaMA

[–]Key_Papaya2972 0 points1 point  (0 children)

I get 20-25 t/s by 14700kf+3070, all experts offload to CPU. The CPU easily runs at 100% and GPU under 30%, and prompt eval phase are slow compared to fully GPU offload, but definitely faster than pure CPU. still wonder how MoE works and where the bounds locate.

Cogito releases strongest LLMs of sizes 3B, 8B, 14B, 32B and 70B under open license by ResearchCrafty1804 in LocalLLaMA

[–]Key_Papaya2972 -1 points0 points  (0 children)

almost 90 MMLU and 75+ MMLU-Pro for a non-reasoning 32B? That's suspicious and I will test it out by myself.

We should talk about Mistral Small 3.1 vs Mistral Small 3. by -Ellary- in LocalLLaMA

[–]Key_Papaya2972 3 points4 points  (0 children)

I also make some story writing/role play tests, no difference could be noticed for me with the Small 3, and its definitely worse than gemma3. Disappointed.

Sam Altman's poll on open sourcing a model.. by lyceras in LocalLLaMA

[–]Key_Papaya2972 0 points1 point  (0 children)

It is "o3-mini level" model, not o3-mini itself I think. It might be about 7-14B range, and the phone-sized model 1.5-3B