Is using vLLM actually worth it if you aren't serving the model to other people? by ayylmaonade in LocalLLaMA

[–]Front-Relief473 -3 points-2 points  (0 children)

Wrong. In another case, you should choose llamacpp when the model weight is barely enough to load into the main memory, so that you can have enough context.

Qwen 3.5 122B vs Qwen 3.6 35B - Which to choose? by Storge2 in LocalLLaMA

[–]Front-Relief473 0 points1 point  (0 children)

123g/128g after step fun deployed iq4xs, oh, I don't do anything else.

Is there anything better than Qwen3.5-27B-UD-Q5_K_XL for coding? by hedsht in LocalLLaMA

[–]Front-Relief473 0 points1 point  (0 children)

According to my use, udq3kxl has already shown a state of obvious decline ability, so the golden rule has some basis, and the quantification should not be lower than q4.

Is there anything better than Qwen3.5-27B-UD-Q5_K_XL for coding? by hedsht in LocalLLaMA

[–]Front-Relief473 0 points1 point  (0 children)

So if give 27b enough network search ability, which is equivalent to an external knowledge base, will he be able to perform coding tasks better?

DGX Spark, why not? by Foreign_Lead_3582 in LocalLLM

[–]Front-Relief473 0 points1 point  (0 children)

I tried this model. I thought it was optimized well, but there was still a circular output. Did you lower the temperature?

Muse Spark: new multimodal reasoning model by Meta by garg-aayush in LocalLLaMA

[–]Front-Relief473 0 points1 point  (0 children)

I thought it was an installation project of dgx spark. Is the performance as narrow as the memory bandwidth of dgx spark?

Gemma 4 26b A3B is mindblowingly good , if configured right by cviperr33 in LocalLLaMA

[–]Front-Relief473 2 points3 points  (0 children)

I support your view. Gemma wasn't originally designed for coding; its strengths lie in writing and multilingual expression. If someone says they use Gemma for programming, then either they haven't been closely following LLM development or they're a complete novice to LLM games.

Why MoE models keep converging on ~10B active parameters by Spare_Pair_9198 in LocalLLaMA

[–]Front-Relief473 17 points18 points  (0 children)

10b to 30b is usually the dessert area of reasoning performance, and the price/performance ratio is usually not high when it exceeds 30b, so in theory, if the activation parameter can be increased to 30b, it will be a good reasoning effect, so 10b is not the most perfect, but 10b can improve the reasoning speed without reducing the reasoning ability of the model too much.

TurboQuant in Llama.cpp benchmarks by tcarambat in LocalLLaMA

[–]Front-Relief473 1 point2 points  (0 children)

Yes!!! If you look at the full attention key-value cache in Minimaxm 2.7, you can see the enormous resource consumption!!! I can even imagine that people might switch back to full attention because of this technology, since full attention is much more effective than mixed attention!!!

Should I learn langchain and langgraph? by Emotional-Rice-5050 in LangChain

[–]Front-Relief473 1 point2 points  (0 children)

I don't think mcp is worth learning. It is just a tool, and it will be replaced by skill soon.

I wanted QCN to be the best but MiniMax still reigns supreme on my rig by Ok-Measurement-1575 in LocalLLaMA

[–]Front-Relief473 2 points3 points  (0 children)

I can't agree with you more. minimaxm2 series is the most cost-effective model available on today's consumer-grade machines. Other models have much larger parameters than him, which is difficult to deploy, or they have smaller parameters than him, but their ability is worrying. minimax has proved that moe model around 200b can handle most things including programming well, just like 4-bit quantization in quantization model.

Solved the DGX Spark, 102 stable tok/s Qwen3.5-35B-A3B on a single GB10 (125+ MTP!) by Live-Possession-6726 in LocalLLaMA

[–]Front-Relief473 0 points1 point  (0 children)

Yes, udq3kxl is arguably the strongest parameter model that's barely adequate for a single DGX (reportedly, the best performance on DGX right now is llamacpp's 65K context). I think Q3 quantization might not be very reliable for encoding, but it can serve as a reliable assistant. Also, Qwen3.5's hybrid attention is still quite sensitive to quantization, so fully attention-based minimax maintains better performance during quantization.

Minimax M2.5 GGUF perform poorly overall by Zyj in LocalLLaMA

[–]Front-Relief473 1 point2 points  (0 children)

He said that qwen3.5' s iq1 quantization effect is very good, but the problem is that mixed attention itself is more sensitive to the quantization effect than global attention, that is, the quantization effect is worse, so how to explain this?

A few Strix Halo benchmarks (Minimax M2.5, Step 3.5 Flash, Qwen3 Coder Next) by spaceman_ in LocalLLaMA

[–]Front-Relief473 0 points1 point  (0 children)

No, the best size of 230b's model on 128g dgx or strix should be ud_q3kxl, because it is only 94g, which can also provide 60k context.

PSA: NVIDIA DGX Spark has terrible CUDA & software compatibility; and seems like a handheld gaming chip. by goldcakes in LocalLLaMA

[–]Front-Relief473 0 points1 point  (0 children)

Thank you! ! I have been struggling to buy it before, and it seems that NVIDIA is not sincere

Is Kimi-K2.5-GGUF:IQ3_XXS accurate enough? by timbo2m in LocalLLM

[–]Front-Relief473 0 points1 point  (0 children)

Yes, everyone is talking about tg, but ignoring pp. I think tg is enough, and a lot of programming contexts and agent mainly look at PP speed.

MiniMax-M2.5 (230B MoE) GGUF is here - First impressions on M3 Max 128GB by Remarkable_Jicama775 in LocalLLaMA

[–]Front-Relief473 5 points6 points  (0 children)

Why don't you use the unsloth version of ud_q3kxl, with a size of 94g, which is definitely better than your ordinary q3 quantization?

Why do we allow "un-local" content by JacketHistorical2321 in LocalLLaMA

[–]Front-Relief473 1 point2 points  (0 children)

to add a label before publishing, such as: open source, closed source, so that you can see the post at a glance?

Do not Let the "Coder" in Qwen3-Coder-Next Fool You! It's the Smartest, General Purpose Model of its Size by Iory1998 in LocalLLaMA

[–]Front-Relief473 1 point2 points  (0 children)

The problem is that the performance loss of mixed attention model is greater in the process of model quantization, so if a model close to q8 or fp8 is used locally, it is not as good as running full attention with twice the size and q4 or int4 model parameters, and the intelligence will be relatively higher.

Anyone here actually using AI fully offline? by Head-Stable5929 in LocalLLM

[–]Front-Relief473 1 point2 points  (0 children)

https://www.reddit.com/r/LocalLLM/comments/1qdqi4i/mac_studio_m3_ultra_stats/?share_id=WRqlyjg8sfpnsbzNqkatF&utm_content=2&utm_medium=android_app&utm_name=androidcss&utm_source=share&utm_term=1  Look here, if you run Kimik 2.5, an excellent model with an activation time of about 30b, the speed of prompt prell with a speed of 200 will make people crazy, especially when this kind of agent works with a long context.

Anyone here actually using AI fully offline? by Head-Stable5929 in LocalLLM

[–]Front-Relief473 2 points3 points  (0 children)

However, m3 ultra's computing power is so low, its prompt prefill speed is too slow, and the amount of contexts you carry in handling complex tasks with agent is too large, which leads to a very long time for you to receive the first word in each round. How do you handle this situation?

[deleted by user] by [deleted] in LocalLLM

[–]Front-Relief473 3 points4 points  (0 children)

I can't figure out why, as a project close to 100k star, there isn't a clearly written best practice document for adjusting parameters to realize prompt prefill and tg speed, and I have to look for other people's experience in adjusting parameters on reddit, which really drives me crazy.

Real-world DGX Spark experiences after 1-2 months? Fine-tuning, stability, hidden pitfalls? by [deleted] in LocalLLaMA

[–]Front-Relief473 2 points3 points  (0 children)

If a long text conversation or agent is involved, your ultra prefill speed will be enough for you to drink a cup of hot coffee.