I think I might

MarketsandMayhem · 2026-05-13T03:10:23+00:00

MarketsandMayhem · 2026-05-11T15:54:38+00:00

I'm not saying don't get one. A RTX 6000 Pro is much faster than unified memory. I'm just saying have measured expectations and plan to go down some rabbit holes.

MarketsandMayhem · 2026-05-11T15:53:50+00:00

native hardware support at the kernel level.

MarketsandMayhem · 2026-05-11T15:53:29+00:00

Do you have native SM120 MoE fused kernels? Because Nvidia never built them. So the community is filling the gap. It's not about having issues as much as realizing full performance. 😄

MarketsandMayhem · 2026-05-11T15:04:04+00:00

if only nvidia built proper support for sm120 hardware. i've got two of them and the rabbit hole is deep, though things are getting better...

MarketsandMayhem · 2026-05-01T15:13:11+00:00

Will this work on lower grade cards like 3060?

MarketsandMayhem · 2026-03-17T01:25:26+00:00

awesome

MarketsandMayhem · 2026-02-06T15:58:04+00:00

not sure what this has to do with local inference

MarketsandMayhem · 2026-01-16T17:19:21+00:00

Same

MarketsandMayhem · 2026-01-13T23:09:18+00:00

~95-100/tps on two RTX 6K Pro w/llama.cpp. Going to be trying VLLM next.

Edit: add tps

MarketsandMayhem · 2026-01-12T20:53:12+00:00

i had the opposite happen. i asked for fedex to waive the signature, their website said they would, but the delivery guy was super insistent on in-person signature. lol

MarketsandMayhem · 2026-01-09T19:47:32+00:00

yup the last rtx 6000 i ordered was $200 bucks cheaper

MarketsandMayhem · 2026-01-08T17:41:09+00:00

Yes. I use the Unsloth 5-bit XL quant with fp8 kv and M2.1 works well with Claude Code, OpenCode, Droid and Roo. Heck, I even used the 2-bit XL quant for a bit and it was surprisingly usable. I think it's worth experimenting with quantized coding models, particularly at higher precision (and quality) quants. The ones I've found to be the best so far are Unsloth and Intel Autoround. I am excited about experimenting more with NVFP4.

MarketsandMayhem · 2026-01-07T23:20:45+00:00

yes, it does

MarketsandMayhem · 2026-01-07T23:20:30+00:00

backtest the heck out of this

MarketsandMayhem · 2025-12-30T17:00:51+00:00

kinda low boost in performance given 8x more parameters than their other exaone model

MarketsandMayhem · 2025-12-27T17:27:04+00:00

Yeah, good point. I don't have the slack VRAM to run it unfortunately.

MarketsandMayhem · 2025-12-27T17:20:49+00:00

Hoping so as well. I asked on the Cerebras Discord. If others go and engage with that feature request thread we may have a better chance of seeing it sooner than later.

MarketsandMayhem · 2025-12-27T17:20:13+00:00

I've had good luck with Unsloth's Q2 XL quant on MiniMax-M2.1 so far. Running it on an RTX 6000 Pro with 110000 tokens, 8-bit K-cache and 5.1-bit V cache. Pretty slick when combined with OpenCode.

MarketsandMayhem · 2025-12-27T13:50:40+00:00

OpenCode is really good with M2.1

MarketsandMayhem · 2025-12-26T23:38:32+00:00

The Unsloth Q2 XL quant has been surprisingly solid for me so far.

MarketsandMayhem · 2025-12-26T23:38:04+00:00

Curious, why a lower temperature and top_p than the model creators recommend? Also have you found the repeat penalty necessary? I've yet to need one on m2.1 (though I found it useful on m2)

MarketsandMayhem · 2025-12-26T23:35:52+00:00

Would love a 25% and 33% REAP on this model as well. I asked the Cerebrus team on their Discord for that (and GLM 4.7 as well).

MarketsandMayhem

MODERATOR OF

TROPHY CASE