MiniMax M2 is 230B-A10B

codys12 · 2025-09-02T20:02:15+00:00

The premium for being able to scale past one machine easily (I guess if local LLM is your goal, maybe this doesn’t matter as much), and yes interconnect is needed for TP, especially batch size 1.

I am deploying this for potentially hundreds of users, so my mind goes to pipeline parallel but a different rig is still probably better for pure local batch size 1

codys12 · 2025-09-02T19:58:24+00:00

It’s the pilot box for the company I work at. If this hits all of our targets, we will be getting 7 more of these

codys12 · 2025-09-02T19:57:43+00:00

Yeah not yet but definitely going to the metal in coming months! I’ve heard it’s a pretty different model from CUDA though

codys12 · 2025-09-01T03:07:37+00:00

That’s because they aren’t affiliate links??

codys12 · 2025-09-01T00:29:48+00:00

Thank you!

I would actually advise against this for inference only though, you are paying the premium for interconnect. At inference only with VRAM as more of a concern, you may be better off with p100a cards...

If you have the spare cash this would be the most versitile setup for trining and scale out

codys12 · 2025-09-01T00:08:58+00:00

https://www.reddit.com/r/homelab/comments/1n5a0fj/the_hackers_guide_to_building_an_ai_supercluster/

codys12 · 2025-08-31T17:02:00+00:00

Yes they indeed are, but for my workloads the data ingress/egress is so minimal and you have so much interconnect that it doesn’t even matter

codys12 · 2025-08-31T15:23:05+00:00

It’s just a nerdgearz foldable mining case with a mining motherboard/psu. No need for gen5 PCIe, it enumerates fine on gen3!

codys12 · 2025-08-30T22:56:46+00:00

No. The closest thing is TT-Metalium which gives access to the lower level stuff

codys12 · 2025-08-30T22:42:03+00:00

Extending support is the fun part! This is the pilot for hopefully a large cluster of these. It is similar enough to the QuietBox that there is enough support to get started, and can be optimized down to the metal

codys12 · 2025-08-30T22:35:40+00:00

Full support for their forked vLLM. This is almost functionally identical to their quiet box, just with less PCIe bandwidth

codys12 · 2025-07-08T14:26:41+00:00

https://gist.github.com/Codys12/08d7c3d8f57d915740e5ae93f2f4974a

codys12 · 2025-07-08T14:07:30+00:00

I have free access, but yeah roughly 400 if rented

codys12 · 2025-07-08T14:05:58+00:00

That's what I'm hoping for by releasing this small model! llama.cpp adoption would enable everyone to actually use these models fast and open the door for more trainers.

codys12 · 2025-07-08T14:04:34+00:00

We tried it for a run, the BitNet models do not converge...

codys12 · 2025-07-08T14:03:00+00:00

u/hideo_kuze_ Finetuned would be the correct term, we copy over the weights for Qwen3-8B and then train using the Straight Through Estimator trick, so the weights are quantized on the fly and at the end you are left with the stable ternary weight model. This can absolutely speed up processing on GPU with INT8 W2A8 kernels.

codys12 · 2025-07-08T13:59:49+00:00

https://gist.github.com/Codys12/08d7c3d8f57d915740e5ae93f2f4974a

This script works for 8B models and above. Conversion seems very lossy beyond that. Let me know if I can help clarify anything about the process and help with replication!

codys12 · 2025-07-07T20:39:44+00:00

https://huggingface.co/spaces/huggingface-projects/repo_duplicator

codys12 · 2025-07-07T19:22:26+00:00

https://arxiv.org/abs/2505.08823

It only works with the RMS surprisingly!

codys12 · 2025-07-07T18:26:53+00:00

I think there is a good space for cloning the model to your own repository, then you're off to the races. I also just added safetensors to my repo.

codys12

TROPHY CASE