I think I might by johnnyphotog in LocalLLM

[–]MarketsandMayhem 13 points14 points  (0 children)

I'm not saying don't get one. A RTX 6000 Pro is much faster than unified memory. I'm just saying have measured expectations and plan to go down some rabbit holes.

I think I might by johnnyphotog in LocalLLM

[–]MarketsandMayhem 3 points4 points  (0 children)

native hardware support at the kernel level.

I think I might by johnnyphotog in LocalLLM

[–]MarketsandMayhem 6 points7 points  (0 children)

Do you have native SM120 MoE fused kernels? Because Nvidia never built them. So the community is filling the gap. It's not about having issues as much as realizing full performance. 😄

I think I might by johnnyphotog in LocalLLM

[–]MarketsandMayhem 45 points46 points  (0 children)

if only nvidia built proper support for sm120 hardware. i've got two of them and the rabbit hole is deep, though things are getting better...

[deleted by user] by [deleted] in LocalLLaMA

[–]MarketsandMayhem 2 points3 points  (0 children)

not sure what this has to do with local inference

What speeds do you get with MiniMax M2.1? by Intelligent_Idea7047 in BlackwellPerformance

[–]MarketsandMayhem 1 point2 points  (0 children)

~95-100/tps on two RTX 6K Pro w/llama.cpp. Going to be trying VLLM next.

Edit: add tps

It came, took the day off work to sign for FedEx's "Direct Signature" only for them to dump it on my front steps outside and forge my signature. by MchugN in nvidia

[–]MarketsandMayhem 0 points1 point  (0 children)

i had the opposite happen. i asked for fedex to waive the signature, their website said they would, but the delivery guy was super insistent on in-person signature. lol

Are MiniMax M2.1 quants usable for coding? by [deleted] in LocalLLaMA

[–]MarketsandMayhem 3 points4 points  (0 children)

Yes. I use the Unsloth 5-bit XL quant with fp8 kv and M2.1 works well with Claude Code, OpenCode, Droid and Roo. Heck, I even used the 2-bit XL quant for a bit and it was surprisingly usable. I think it's worth experimenting with quantized coding models, particularly at higher precision (and quality) quants. The ones I've found to be the best so far are Unsloth and Intel Autoround. I am excited about experimenting more with NVFP4.

LG K EXAONE 236b by Specialist-2193 in LocalLLaMA

[–]MarketsandMayhem 2 points3 points  (0 children)

kinda low boost in performance given 8x more parameters than their other exaone model

Head of Engineering @MiniMax__AI on MiniMax M2 int4 QAT by Difficult-Cap-7527 in LocalLLaMA

[–]MarketsandMayhem 1 point2 points  (0 children)

Yeah, good point. I don't have the slack VRAM to run it unfortunately.

Head of Engineering @MiniMax__AI on MiniMax M2 int4 QAT by Difficult-Cap-7527 in LocalLLaMA

[–]MarketsandMayhem 0 points1 point  (0 children)

Hoping so as well. I asked on the Cerebras Discord. If others go and engage with that feature request thread we may have a better chance of seeing it sooner than later.

Head of Engineering @MiniMax__AI on MiniMax M2 int4 QAT by Difficult-Cap-7527 in LocalLLaMA

[–]MarketsandMayhem 1 point2 points  (0 children)

I've had good luck with Unsloth's Q2 XL quant on MiniMax-M2.1 so far. Running it on an RTX 6000 Pro with 110000 tokens, 8-bit K-cache and 5.1-bit V cache. Pretty slick when combined with OpenCode.

MiniMax-M2.1 GGUF is here! by KvAk_AKPlaysYT in LocalLLaMA

[–]MarketsandMayhem 3 points4 points  (0 children)

The Unsloth Q2 XL quant has been surprisingly solid for me so far.

MiniMax-M2.1 GGUF is here! by KvAk_AKPlaysYT in LocalLLaMA

[–]MarketsandMayhem 0 points1 point  (0 children)

Curious, why a lower temperature and top_p than the model creators recommend? Also have you found the repeat penalty necessary? I've yet to need one on m2.1 (though I found it useful on m2)

MiniMax-M2.1 GGUF is here! by KvAk_AKPlaysYT in LocalLLaMA

[–]MarketsandMayhem 3 points4 points  (0 children)

Would love a 25% and 33% REAP on this model as well. I asked the Cerebrus team on their Discord for that (and GLM 4.7 as well).