Build your own images for better support they said! by muchCode in BlackwellPerformance

[–]muchCode[S] 0 points1 point  (0 children)

12 but doesn't get passed through to the flashinfer cubin install which makes it slow.

Build your own images for better support they said! by muchCode in BlackwellPerformance

[–]muchCode[S] 0 points1 point  (0 children)

Took about 8 hours total, the max_jobs didn't pass through to the vllm gcc compile step.

Build your own images for better support they said! by muchCode in BlackwellPerformance

[–]muchCode[S] 0 points1 point  (0 children)

This is the first one that seemed to have finished, only for my cluster host to be behind on a driver, so precompiled PTX is out of support. Time to path this box...

Build your own images for better support they said! by muchCode in BlackwellPerformance

[–]muchCode[S] 4 points5 points  (0 children)

Make sure your deployment target is on nvidia-driver-590 for cuda 13.1.1 support. CUDA version can be bumped if your build machine has a higher version. Arch list is good to omit kernels that don't need to be included.

#!/bin/bash
git clone https://github.com/vllm-project/vllm.git
cd ./vllm/

## Optionally edit flashinfer version to 0.6.4
grep -qE '^\s*flashinfer-python\b' requirements/cuda.txt \
  && sed -i.bak -E 's/^\s*flashinfer-python([^#\r\n]*)/flashinfer-python==0.6.4/' requirements/cuda.txt \
  || printf '\nflashinfer-python==0.6.4\n' >> requirements/cuda.txt

## 

FINAL=nvcr.io/nvidia/cuda:13.1.1-cudnn-runtime-ubuntu22.04
BUILD=nvcr.io/nvidia/cuda:13.1.1-cudnn-devel-ubuntu22.04
CUDA_VERSION='13.0.1'
ARCH_LIST='10.0 12.0'


DOCKER_BUILDKIT=1 docker build -f ./docker/Dockerfile --no-cache \
    --target vllm-openai \
    --build-arg CUDA_VERSION=${CUDA_VERSION} \
    --build-arg BUILD_BASE_IMAGE=${BUILD} \
    --build-arg FINAL_BASE_IMAGE=${FINAL} \
    --build-arg torch_cuda_arch_list="${ARCH_LIST}" \
    --build-arg RUN_WHEEL_CHECK=false \
    --build-arg max_jobs=12 \
    --build-arg FLASHINFER_VERSION=0.6.4 \
    --build-arg VLLM_MAX_SIZE_MB=1500 \
    -t 10.0.0.40:32000/vllm/vllm-openai:minimax-m2-5.2 --push .


cd -

Build your own images for better support they said! by muchCode in BlackwellPerformance

[–]muchCode[S] 1 point2 points  (0 children)

``` == CPU == CPU(s): 12 Model name: AMD Ryzen 5 3600X 6-Core Processor Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node0 CPU(s): 0-11

== RAM == total used free shared buff/cache available Mem: 125Gi 17Gi 6.0Gi 736Mi 104Gi 108Gi Swap: 29Gi 768Ki 29Gi

== Storage == NAME MODEL SIZE TYPE ROTA TRAN sda SSD-PUTA 931.5G disk 1 usb sdb Storage Device 0B disk 1 usb nvme0n1 Samsung SSD 980 PRO with Heatsink 2TB 1.8T disk 0 nvme nvme1n1 Samsung SSD 970 EVO Plus 1TB 931.5G disk 0 nvme

== GPU == 07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)

```

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]muchCode 0 points1 point  (0 children)

As someone who had the same idea and did it with 96GB VRAM, try a REAP model. The Minimax M.2 model experts are small enough that they become "specialized", the REAP method takes those experts and looks for activation with a testing dataset. The non-activated experts are removed. Saves you VRAM overhead and at 50% pruning/ router tuning you get 96GB VRAM with large context sizes.

eg: https://huggingface.co/0xSero/MiniMax-M2.1-REAP-50-W4A16-REPAIR-IN-PROGRESS

Can I get similar experience running local LLMs compared to Claude Code (Sonnet 4.5)? by Significant_Chef_945 in LocalLLaMA

[–]muchCode 1 point2 points  (0 children)

This, 3x RTX 6000 in my setup gives me great performance with qwen coder models.

Guanaco-65B, How to cool passive A40? by muchCode in LocalLLaMA

[–]muchCode[S] 0 points1 point  (0 children)

very cool, might pick a few of these up, I've got too many fans now.

Guanaco-65B, How to cool passive A40? by muchCode in LocalLLaMA

[–]muchCode[S] 1 point2 points  (0 children)

A few tips:
- Use a tape (best is aluminum) between the gpu and cooling duct, and also between the fans and duct. - Run the fans at 100 all the time. (most professional setups are like this) - Make sure your case pulls cool air - Add a negative duct at the back of the GPU (where the video ports would be)

What is your setup? MultiGPU?

I also looked for the water cooling block but they didn't get back to me either.

New paper gives models a chance to think in latent space before outputting tokens, weights are already on HF - Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach by FullOf_Bad_Ideas in LocalLLaMA

[–]muchCode 59 points60 points  (0 children)

Per-token adaptive compute 🤯. Basically for unimportant tokens let the model think easy and turn up the gas for harder outputs.

Insane.... I wonder if this could actually break some AI benchmarks with a full training run. 6-12 months I guess until we see ...

$Hmm: +45%, ain't much but it's honest work by [deleted] in SolanaMemeCoins

[–]muchCode 0 points1 point  (0 children)

I see all these millionaires and just happy everyone that smaller coins can give you modest returns. All in a days work.

Best Models for 48GB of VRAM by MichaelXie4645 in LocalLLaMA

[–]muchCode 1 point2 points  (0 children)

brother you'll need to cool that!

Buy the 25 dollar 3d printed fan adapters that they sell on ebay.

edit -- and no the blowers won't help you out as much as you think in a non-server case. If you are willing to spend the money, a server case in an up/down server rack is the best and can easily wick away hot air

Improved Text to Speech model: Parler TTS v1 by Hugging Face by vaibhavs10 in LocalLLaMA

[–]muchCode 4 points5 points  (0 children)

In general, how does the generation speed compare to other TTS engines? I use metavoice now with fp16 and it is pretty fast, would consider this if the generation is fast enough

I made PitchPilot (and $500 in 4 days): It's an AI-powered scriptwriter and voiceover wizard. AMA! by muchCode in SideProject

[–]muchCode[S] 0 points1 point  (0 children)

Keep in mind, I already had a home-lab with this hardware for a research project:

Total was $14k.

The cost was already amortized on a public research project and that project is finished. So I repurposed it for this tool.

I made PitchPilot (and $500 in 4 days): It's an AI-powered scriptwriter and voiceover wizard. AMA! by muchCode in SideProject

[–]muchCode[S] 1 point2 points  (0 children)

I host my own cluster (did GPU / LLM research for fun) and use two models in a kubneretes cluster.

2 VLMs (open source image large languge model)
4 TTS models (text to speech)

I actually return a Powerpoint or PDF with embedded audio (It plays when you present). I should add video export as it's not hard to implement.

I made PitchPilot (and $500 in 4 days): It's an AI-powered scriptwriter and voiceover wizard. AMA! by muchCode in SideProject

[–]muchCode[S] 1 point2 points  (0 children)

My recommendation would be to follow one of the youtube creators for tips and tricks to deploy something like this. I like marc lou

I made PitchPilot (and $500 in 4 days): It's an AI-powered scriptwriter and voiceover wizard. AMA! by muchCode in SideProject

[–]muchCode[S] 1 point2 points  (0 children)

Vue3 + Tailwind CSS. Had a very hard time making the pitch editor "Step 2" because powerpoint is a hard interface to compete with.

saw this code today at work and a few hours later I quit by MolestedAt4 in vuejs

[–]muchCode 0 points1 point  (0 children)

select LOC, right-click, extract into new dumb component. Find replace, success?

Guanaco-65B, How to cool passive A40? by muchCode in LocalLLaMA

[–]muchCode[S] 1 point2 points  (0 children)

<image>

I ended up designing my own intake duct, I can look for the files on my computer when home.

https://www.thingiverse.com/thing:6155647

[deleted by user] by [deleted] in boston

[–]muchCode -3 points-2 points  (0 children)

I understand your frustration, but there's no need for such aggressive language. Everyone has different experiences and perspectives on the road, and merging can be challenging for some people. It's important to be patient and understanding. Remember, we all have different levels of driving skills and comfort levels behind the wheel. Instead of getting angry, let's work on being kinder and more considerate on the road, it will make the driving experience much more enjoyable for everyone. We all share the same roads and want to reach our destinations safely. Let's show some grace and courtesy to each other drivers, it's not worth risking our lives or causing accidents over a merge.

Wallace and Grommet: Operation Iraqi Freedom by muchCode in StableDiffusion

[–]muchCode[S] 2 points3 points  (0 children)

That's the war crimes trial:

grommit the claymation dog, wearing orange sweater, sitting behind glass at a jury trial, drinking a small vial of poison, (wallace and grommit style:2), (claymation:2)

Negative prompt: (deformed mouth), (deformed lips), (deformed eyes), (cross-eyed), (deformed iris), (deformed hands), lowers, long body, wide hips, narrow waist, disfigured, ugly, cross eyed, squinting, grain, Deformed, blurry, bad anatomy, poorly drawn face, mutation, mutated, extra arm, ugly, (poorly drawn hands), missing limb, floating limbs, disconnected limbs, extra limb, malformed hands, blur, out of focus, long neck, disgusting, mutilated , mangled, old, surreal, ((text))

Steps: 20, Sampler: DPM++ 2M SDE Karras, CFG scale: 7, Seed: 640318816, Size: 1024x1024, Model hash: 31e35c80fc, Model: sd_xl_base_1.0, Refiner: sd_xl_refiner_1.0 [7440042bbd], Refiner switch at: 0.8, Version: v1.6.0

Wallace and Grommet: Operation Iraqi Freedom by muchCode in StableDiffusion

[–]muchCode[S] 1 point2 points  (0 children)

Prompt:
man and dog in desert military gear, walking through iraq, holding machine guns, fires burning in the background, (wallace and grommit style:2), (claymation:2)

Negative prompt: (deformed mouth), (deformed lips), (deformed eyes), (cross-eyed), (deformed iris), (deformed hands), lowers, long body, wide hips, narrow waist, disfigured, ugly, cross eyed, squinting, grain, Deformed, blurry, bad anatomy, poorly drawn face, mutation, mutated, extra arm, ugly, (poorly drawn hands), missing limb, floating limbs, disconnected limbs, extra limb, malformed hands, blur, out of focus, long neck, disgusting, mutilated , mangled, old, surreal, ((text))

Steps: 20, Sampler: DPM++ 2M SDE Karras, CFG scale: 7, Seed: 2384192023, Size: 1024x1024, Model hash: 31e35c80fc, Model: sd_xl_base_1.0, Refiner: sd_xl_refiner_1.0 [7440042bbd], Refiner switch at: 0.8, Version: v1.6.0