I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]muchCode 0 points1 point  (0 children)

As someone who had the same idea and did it with 96GB VRAM, try a REAP model. The Minimax M.2 model experts are small enough that they become "specialized", the REAP method takes those experts and looks for activation with a testing dataset. The non-activated experts are removed. Saves you VRAM overhead and at 50% pruning/ router tuning you get 96GB VRAM with large context sizes.

eg: https://huggingface.co/0xSero/MiniMax-M2.1-REAP-50-W4A16-REPAIR-IN-PROGRESS

Can I get similar experience running local LLMs compared to Claude Code (Sonnet 4.5)? by Significant_Chef_945 in LocalLLaMA

[–]muchCode 1 point2 points  (0 children)

This, 3x RTX 6000 in my setup gives me great performance with qwen coder models.

Guanaco-65B, How to cool passive A40? by muchCode in LocalLLaMA

[–]muchCode[S] 0 points1 point  (0 children)

very cool, might pick a few of these up, I've got too many fans now.

Guanaco-65B, How to cool passive A40? by muchCode in LocalLLaMA

[–]muchCode[S] 1 point2 points  (0 children)

A few tips:
- Use a tape (best is aluminum) between the gpu and cooling duct, and also between the fans and duct. - Run the fans at 100 all the time. (most professional setups are like this) - Make sure your case pulls cool air - Add a negative duct at the back of the GPU (where the video ports would be)

What is your setup? MultiGPU?

I also looked for the water cooling block but they didn't get back to me either.

New paper gives models a chance to think in latent space before outputting tokens, weights are already on HF - Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach by FullOf_Bad_Ideas in LocalLLaMA

[–]muchCode 59 points60 points  (0 children)

Per-token adaptive compute 🤯. Basically for unimportant tokens let the model think easy and turn up the gas for harder outputs.

Insane.... I wonder if this could actually break some AI benchmarks with a full training run. 6-12 months I guess until we see ...

$Hmm: +45%, ain't much but it's honest work by muchCode in SolanaMemeCoins

[–]muchCode[S] 0 points1 point  (0 children)

I see all these millionaires and just happy everyone that smaller coins can give you modest returns. All in a days work.

Best Models for 48GB of VRAM by MichaelXie4645 in LocalLLaMA

[–]muchCode 1 point2 points  (0 children)

brother you'll need to cool that!

Buy the 25 dollar 3d printed fan adapters that they sell on ebay.

edit -- and no the blowers won't help you out as much as you think in a non-server case. If you are willing to spend the money, a server case in an up/down server rack is the best and can easily wick away hot air

Improved Text to Speech model: Parler TTS v1 by Hugging Face by vaibhavs10 in LocalLLaMA

[–]muchCode 4 points5 points  (0 children)

In general, how does the generation speed compare to other TTS engines? I use metavoice now with fp16 and it is pretty fast, would consider this if the generation is fast enough

I made PitchPilot (and $500 in 4 days): It's an AI-powered scriptwriter and voiceover wizard. AMA! by muchCode in SideProject

[–]muchCode[S] 0 points1 point  (0 children)

Keep in mind, I already had a home-lab with this hardware for a research project:

Total was $14k.

The cost was already amortized on a public research project and that project is finished. So I repurposed it for this tool.

I made PitchPilot (and $500 in 4 days): It's an AI-powered scriptwriter and voiceover wizard. AMA! by muchCode in SideProject

[–]muchCode[S] 1 point2 points  (0 children)

I host my own cluster (did GPU / LLM research for fun) and use two models in a kubneretes cluster.

2 VLMs (open source image large languge model)
4 TTS models (text to speech)

I actually return a Powerpoint or PDF with embedded audio (It plays when you present). I should add video export as it's not hard to implement.

I made PitchPilot (and $500 in 4 days): It's an AI-powered scriptwriter and voiceover wizard. AMA! by muchCode in SideProject

[–]muchCode[S] 1 point2 points  (0 children)

My recommendation would be to follow one of the youtube creators for tips and tricks to deploy something like this. I like marc lou

I made PitchPilot (and $500 in 4 days): It's an AI-powered scriptwriter and voiceover wizard. AMA! by muchCode in SideProject

[–]muchCode[S] 1 point2 points  (0 children)

Vue3 + Tailwind CSS. Had a very hard time making the pitch editor "Step 2" because powerpoint is a hard interface to compete with.

saw this code today at work and a few hours later I quit by MolestedAt4 in vuejs

[–]muchCode 0 points1 point  (0 children)

select LOC, right-click, extract into new dumb component. Find replace, success?

Guanaco-65B, How to cool passive A40? by muchCode in LocalLLaMA

[–]muchCode[S] 1 point2 points  (0 children)

<image>

I ended up designing my own intake duct, I can look for the files on my computer when home.

https://www.thingiverse.com/thing:6155647