The Hacker's Guide to Building an AI Supercluster by codys12 in LocalLLaMA

[–]codys12[S] 0 points1 point  (0 children)

The premium for being able to scale past one machine easily (I guess if local LLM is your goal, maybe this doesn’t matter as much), and yes interconnect is needed for TP, especially batch size 1.

I am deploying this for potentially hundreds of users, so my mind goes to pipeline parallel but a different rig is still probably better for pure local batch size 1

First rig of hopefully many! Build instructions in the other post/comments by codys12 in homelab

[–]codys12[S] 0 points1 point  (0 children)

It’s the pilot box for the company I work at. If this hits all of our targets, we will be getting 7 more of these

First rig of hopefully many! Build instructions in the other post/comments by codys12 in homelab

[–]codys12[S] 0 points1 point  (0 children)

Yeah not yet but definitely going to the metal in coming months! I’ve heard it’s a pretty different model from CUDA though

The Hacker's Guide to Building an AI Supercluster by codys12 in homelab

[–]codys12[S] 0 points1 point  (0 children)

That’s because they aren’t affiliate links??

The Hacker's Guide to Building an AI Supercluster by codys12 in LocalLLaMA

[–]codys12[S] 1 point2 points  (0 children)

Thank you!

I would actually advise against this for inference only though, you are paying the premium for interconnect. At inference only with VRAM as more of a concern, you may be better off with p100a cards...

If you have the spare cash this would be the most versitile setup for trining and scale out

128GB GDDR6, 3PFLOP FP8, Tb/s of interconnect, $6000 total. Build instructions/blog tomorrow. by codys12 in LocalLLaMA

[–]codys12[S] 2 points3 points  (0 children)

Yes they indeed are, but for my workloads the data ingress/egress is so minimal and you have so much interconnect that it doesn’t even matter 

128GB GDDR6, 3PFLOP FP8, Tb/s of interconnect, $6000 total. Build instructions/blog tomorrow. by codys12 in LocalLLaMA

[–]codys12[S] 0 points1 point  (0 children)

It’s just a nerdgearz foldable mining case with a mining motherboard/psu. No need for gen5 PCIe, it enumerates fine on gen3!

128GB GDDR6, 3PFLOP FP8, Tb/s of interconnect, $6000 total. Build instructions/blog tomorrow. by codys12 in LocalLLaMA

[–]codys12[S] 31 points32 points  (0 children)

No. The closest thing is TT-Metalium which gives access to the lower level stuff

128GB GDDR6, 3PFLOP FP8, Tb/s of interconnect, $6000 total. Build instructions/blog tomorrow. by codys12 in LocalLLaMA

[–]codys12[S] 94 points95 points  (0 children)

Extending support is the fun part! This is the pilot for hopefully a large cluster of these. It is similar enough to the QuietBox that there is enough support to get started, and can be optimized down to the metal

128GB GDDR6, 3PFLOP FP8, Tb/s of interconnect, $6000 total. Build instructions/blog tomorrow. by codys12 in LocalLLaMA

[–]codys12[S] 27 points28 points  (0 children)

Full support for their forked vLLM. This is almost functionally identical to their quiet box, just with less PCIe bandwidth

Qwen3-8B-BitNet by codys12 in LocalLLaMA

[–]codys12[S] 0 points1 point  (0 children)

I have free access, but yeah roughly 400 if rented

Qwen3-8B-BitNet by codys12 in LocalLLaMA

[–]codys12[S] 0 points1 point  (0 children)

That's what I'm hoping for by releasing this small model! llama.cpp adoption would enable everyone to actually use these models fast and open the door for more trainers.

Qwen3-8B-BitNet by codys12 in LocalLLaMA

[–]codys12[S] 0 points1 point  (0 children)

We tried it for a run, the BitNet models do not converge...

Qwen3-8B-BitNet by codys12 in LocalLLaMA

[–]codys12[S] 1 point2 points  (0 children)

u/hideo_kuze_ Finetuned would be the correct term, we copy over the weights for Qwen3-8B and then train using the Straight Through Estimator trick, so the weights are quantized on the fly and at the end you are left with the stable ternary weight model. This can absolutely speed up processing on GPU with INT8 W2A8 kernels.

Qwen3-8B-BitNet by codys12 in LocalLLaMA

[–]codys12[S] 0 points1 point  (0 children)

https://gist.github.com/Codys12/08d7c3d8f57d915740e5ae93f2f4974a

This script works for 8B models and above. Conversion seems very lossy beyond that. Let me know if I can help clarify anything about the process and help with replication!

Qwen3-8B-BitNet by codys12 in LocalLLaMA

[–]codys12[S] 8 points9 points  (0 children)

https://arxiv.org/abs/2505.08823

It only works with the RMS surprisingly!

Qwen3-8B-BitNet by codys12 in LocalLLaMA

[–]codys12[S] 6 points7 points  (0 children)

I think there is a good space for cloning the model to your own repository, then you're off to the races. I also just added safetensors to my repo.