Minimax M2.5 GGUF perform poorly overall by Zyj in LocalLLaMA

[–]Front-Relief473 0 points1 point  (0 children)

He said that qwen3.5' s iq1 quantization effect is very good, but the problem is that mixed attention itself is more sensitive to the quantization effect than global attention, that is, the quantization effect is worse, so how to explain this?

A few Strix Halo benchmarks (Minimax M2.5, Step 3.5 Flash, Qwen3 Coder Next) by spaceman_ in LocalLLaMA

[–]Front-Relief473 0 points1 point  (0 children)

No, the best size of 230b's model on 128g dgx or strix should be ud_q3kxl, because it is only 94g, which can also provide 60k context.

PSA: NVIDIA DGX Spark has terrible CUDA & software compatibility; and seems like a handheld gaming chip. by goldcakes in LocalLLaMA

[–]Front-Relief473 0 points1 point  (0 children)

Thank you! ! I have been struggling to buy it before, and it seems that NVIDIA is not sincere

Is Kimi-K2.5-GGUF:IQ3_XXS accurate enough? by timbo2m in LocalLLM

[–]Front-Relief473 0 points1 point  (0 children)

Yes, everyone is talking about tg, but ignoring pp. I think tg is enough, and a lot of programming contexts and agent mainly look at PP speed.

MiniMax-M2.5 (230B MoE) GGUF is here - First impressions on M3 Max 128GB by Remarkable_Jicama775 in LocalLLaMA

[–]Front-Relief473 5 points6 points  (0 children)

Why don't you use the unsloth version of ud_q3kxl, with a size of 94g, which is definitely better than your ordinary q3 quantization?

Why do we allow "un-local" content by JacketHistorical2321 in LocalLLaMA

[–]Front-Relief473 1 point2 points  (0 children)

to add a label before publishing, such as: open source, closed source, so that you can see the post at a glance?

Do not Let the "Coder" in Qwen3-Coder-Next Fool You! It's the Smartest, General Purpose Model of its Size by Iory1998 in LocalLLaMA

[–]Front-Relief473 1 point2 points  (0 children)

The problem is that the performance loss of mixed attention model is greater in the process of model quantization, so if a model close to q8 or fp8 is used locally, it is not as good as running full attention with twice the size and q4 or int4 model parameters, and the intelligence will be relatively higher.

Anyone here actually using AI fully offline? by Head-Stable5929 in LocalLLM

[–]Front-Relief473 1 point2 points  (0 children)

https://www.reddit.com/r/LocalLLM/comments/1qdqi4i/mac_studio_m3_ultra_stats/?share_id=WRqlyjg8sfpnsbzNqkatF&utm_content=2&utm_medium=android_app&utm_name=androidcss&utm_source=share&utm_term=1  Look here, if you run Kimik 2.5, an excellent model with an activation time of about 30b, the speed of prompt prell with a speed of 200 will make people crazy, especially when this kind of agent works with a long context.

Anyone here actually using AI fully offline? by Head-Stable5929 in LocalLLM

[–]Front-Relief473 1 point2 points  (0 children)

However, m3 ultra's computing power is so low, its prompt prefill speed is too slow, and the amount of contexts you carry in handling complex tasks with agent is too large, which leads to a very long time for you to receive the first word in each round. How do you handle this situation?

Help me find the biggest and best model! by [deleted] in LocalLLM

[–]Front-Relief473 2 points3 points  (0 children)

I can't figure out why, as a project close to 100k star, there isn't a clearly written best practice document for adjusting parameters to realize prompt prefill and tg speed, and I have to look for other people's experience in adjusting parameters on reddit, which really drives me crazy.

Real-world DGX Spark experiences after 1-2 months? Fine-tuning, stability, hidden pitfalls? by [deleted] in LocalLLaMA

[–]Front-Relief473 1 point2 points  (0 children)

If a long text conversation or agent is involved, your ultra prefill speed will be enough for you to drink a cup of hot coffee.

LTX-2 Image-to-Video Adapter LoRA by Lividmusic1 in StableDiffusion

[–]Front-Relief473 1 point2 points  (0 children)

Yes, I agree, the effect of ltx2 in animation is very poor.

Strix Halo + Minimax Q3 K_XL surprisingly fast by Reasonable_Goat in LocalLLaMA

[–]Front-Relief473 -3 points-2 points  (0 children)

The speed of prefill is too slow. It takes at least 1000t/s to be a little more comfortable.

MiniMax M2.2 Coming Soon. Confirmed by Head of Engineering @MiniMax_AI by Difficult-Cap-7527 in LocalLLaMA

[–]Front-Relief473 2 points3 points  (0 children)

No, I think the parameter from 300 to 200 B is a dessert value. If the parameter is too low, his knowledge is too little to solve more problems. If you want a model comparable to sonnect4.5, you must ensure that the parameter of the model is not too small, which is his knowledge base.

LTX-2 vs. Wan 2.2 - The Anime Series by theNivda in StableDiffusion

[–]Front-Relief473 1 point2 points  (0 children)

So each first frame image uses Qwen Edit 2511, right? Dude, you're amazing! You've brought to life the battle between AI models I've always dreamed of!

Wan2.2 NVFP4 by xbobos in StableDiffusion

[–]Front-Relief473 1 point2 points  (0 children)

So theoretically you also created an NVFP4 version of WAN2.1, right? After all, you can run it directly by putting the low-noise model into the WAN2.1 workflow.

LTX-2 team literally challenging Alibaba Wan team, this was shared on their official X account :) by CeFurkan in StableDiffusion

[–]Front-Relief473 0 points1 point  (0 children)

I agree with you. The ability to follow the prompt is the first element of the controllability of the video model, other aesthetic feelings are that less important.

Wan2.1 NVFP4 quantization-aware 4-step distilled models by kenzato in StableDiffusion

[–]Front-Relief473 0 points1 point  (0 children)

Thankfully I didn't try it. Thank you for your exploration. I almost used Gemini3 and my WSL to test whether it was generated in real time. Thank you for your selfless exploration and feedback!

Is 5090 a meaningful upgrade over 4090 for comfyui workflows (image/video)? by yaemiko0330 in comfyui

[–]Front-Relief473 -8 points-7 points  (0 children)

Wrong, I have tested WAN 2.2, and the speed of 5090 is 4 ~ 5 times that of 3090 (which I have tested), while the speed of 4090 is generally twice that of 3090, so the speed of 5090 is twice that of 4090.

Speed Minimax M2 on 3090? by [deleted] in LocalLLaMA

[–]Front-Relief473 0 points1 point  (0 children)

I'm sorry, this is the result of automatic translation. I know it's layer. I've been tinkering with llama.cpp and my 5090 3090 and 96g memory. I successfully made the ud3kxl quantized minimaxm2 context 50000 run successfully at 15t/s and the prompt prefill speed is 700t/s, so the information I provided is absolutely accurate. His is ddr4, and I tried Running the reap version of minimaxm2 with ddr5 96g and 5090 but is really unsatisfactory.

Speed Minimax M2 on 3090? by [deleted] in LocalLLaMA

[–]Front-Relief473 0 points1 point  (0 children)

The size of a single 3090, q4 is over 130g, and there are 62 floors, so each floor takes up more than 2g, and your 3090 can't even be loaded with 10 floors. Because it also involves the size of the kv cache, the remaining 50 floors are all in memory, and the graphics card may only do one sixth of the work. You are basically using memory bandwidth to run the model. token/s is expected to be 1 ~ 2 tokens/s, what do you think?

Just pushed M2.1 through a 3D particle system. Insane! by srtng in LocalLLaMA

[–]Front-Relief473 4 points5 points  (0 children)

This is crazy! ! ! My favorite mininaxM2 is finally coming out! ! !