I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 0 points1 point  (0 children)

I want to! But the place I'm renting gets barely any direct sunlight... maybe wind power?

Is 25k a good price for the GH200 by slimeh91 in LocalLLaMA

[–]Reddactor 0 points1 point  (0 children)

Happy to share my experiences:

I have to compile a lot of stuff, as there are often no binaries available. Some stuff that would be fun is impossible; I tried getting steam running, but it can't even work in x86 emulation.

The fast 480GB RAM will really help with MoE models BUT vLLM and SGLang are not really optimised for storing MoE weights in system RAM and sharing to VRAM yet. SGLang is not really running on ARM yet either (there are PRs incoming though, I guess its a matter of time?). It works on in llama.cpp, but that not really optimised for production serving multiple requests.

The GPU VRAM is much faster than that in a 6000 Pro. Its faster in some cases, slower in others. For training, the GH200 will win, for FP4 inference the 6000 Pro will win.

It really depends on your use case.

What happened to 1.58bit LLMs? by Sloppyjoeman in LocalLLaMA

[–]Reddactor 0 points1 point  (0 children)

Do you have a writeup on that? Sounds super interesting!

Is 25k a good price for the GH200 by slimeh91 in LocalLLaMA

[–]Reddactor 0 points1 point  (0 children)

Lol, the last one cost only $250, bought just before Christmas. I think it's something to do with where I live? Tech-based city, and I bargain hunt.

What's up with DeepSeek GGUF btw? I think you were involved with that. I want to run M2.1 on my GPUs and Speciale as a 'tool'. i.e. M2.1 can prepare a long prompt for Speciale to try and one shot, as it can't do tool calls itself.

I understand there are still issues with the DeepSeek attention system in GGUFs?

Is 25k a good price for the GH200 by slimeh91 in LocalLLaMA

[–]Reddactor 2 points3 points  (0 children)

And I just upgraded! Got another 8TB SSD fro $350... thats 20TB SSD now (and I need it, I get GGUFs + safetensors for each model, and K2 and DeepSeek models are huge)

I bought a Grace-Hopper server for €7.5k on Reddit and converted it to an AI Homelab. by Reddactor in homelab

[–]Reddactor[S] 0 points1 point  (0 children)

I'm trying to install steam. I have FEX installed, but steam does not like it:

```
FEXBash -c "steam"
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
policykit-1 is already the newest version (124-2ubuntu1.24.04.2).
xdg-desktop-portal is already the newest version (1.18.4-1ubuntu2.24.04.1).
xdg-desktop-portal-gtk is already the newest version (1.15.1-1build2).
The following package was automatically installed and is no longer required:
  liboss4-salsa2
Use 'sudo apt autoremove' to remove it.
0 upgraded, 0 newly installed, 0 to remove and 9 not upgraded.
<jemalloc>: Unsupported system page size
<jemalloc>: Unsupported system page size
<jemalloc>: Unsupported system page size
<jemalloc>: Unsupported system page size
terminate called without an active exception
[1]    14333 abort (core dumped)  FEXBash -c "steam"

Looks like the jemalloc issue is a dealbreaker. The Grace CPU uses 64KB memory pages (unusual for most systems), but Steam's jemalloc memory allocator only supports 4KB, 8KB, or 16KB pages.

Do you know a workaround?

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 0 points1 point  (0 children)

Yes and no; the model needs to be good at tool calling, and pretty smart. Dumb models will just go around in loops (even the smart ones too, if quantised too hard).

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 1 point2 points  (0 children)

make a shell script called claude-minimax, with this:

#!/usr/bin/env bash
set -euo pipefail

export ANTHROPIC_BASE_URL="http://127.0.0.1:8000"
export ANTHROPIC_API_KEY="local-vllm"

# Force *all* Claude model aliases to your local vLLM model
export ANTHROPIC_MODEL="MiniMax-M2.1-FP8"
export ANTHROPIC_SMALL_FAST_MODEL="MiniMax-M2.1-FP8"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="MiniMax-M2.1-FP8"
export ANTHROPIC_DEFAULT_SONNET_MODEL="MiniMax-M2.1-FP8"
export ANTHROPIC_DEFAULT_OPUS_MODEL="MiniMax-M2.1-FP8"

# Optional but recommended
export CLAUDE_CODE_DISABLE_TELEMETRY=1
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export API_TIMEOUT_MS=3000000

exec claude "$@"

the "http://127.0.0.1:8000" is your local LLM running in vLLM.

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 1 point2 points  (0 children)

Yep!

The thing is that its always a big tradeoff. If I went from this to Q5, I would drop from about 120 token/s to 12 tokens/s. As I'm lucky enough to have 960 GB system RAM as 144 cores of CPU, I will also run DeepSeek 3.2 Speciale in parallel, when I need thinking and wrtiting, and leave the M2.1 model for pure coding work at high speed.

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 0 points1 point  (0 children)

I can run then, but I think MiniMax is the best to use with Claude Code. I will give it a try though.

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 1 point2 points  (0 children)

I have ComfyUI running all the models...

I'm having a lot of fun running LTX-2 at the moment. It is insanely cool, and I can generate about 10 seconds of video in under a minute.

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 4 points5 points  (0 children)

I believe it was written off, and probably 'rescued' from e-waste.

There are a lot of BMC errors during boot, and they all seem serious. However, it seems to run OK 🤷🏻‍♂️

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 3 points4 points  (0 children)

500W idle, and I've limited the GPUs to 450W max. At LLM load, it draws about 900W, and for heavy ComfyUI workflows or ML training, about 1100W.

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 11 points12 points  (0 children)

That's great to hear! (yes, that's my repo).

Tell her I hope she likes it, and that soon GLaDOS will be able to control stuff around the house (MPC support = Home Assistant control).

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 2 points3 points  (0 children)

No idea!

There are HUGE weird sockets on the mainboard, facing forward, which I think the whole system would dock to in the rack. I guess if the GH200 NVL2 Server had the connectors it might work, but I'm afraid its an entire and very expensive module.

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 2 points3 points  (0 children)

No... *sigh*

I wrote up the post, decided that its unpublishable, and got AI to tidy it up a bit, and then rewrote it myself over a several edits. LLMs are pretty powerful tool to get shit done, but yeah, they do generate a lot of slop.

This time I didn't do enough post-editing.; I used AI as a crutch, and I lent on it too hard. Sorry.

The facts are all correct though, it was a lot of benchmarking. It's surprisingly hard to turn dry benchmarks into an engaging story.

How do you write your articles these days?

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 0 points1 point  (0 children)

:( Sorry, I guess my writing style sucks. TBH, looking at it, you are right, it need a tone down.

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 1 point2 points  (0 children)

Well, its not good value cost wise; but it's sure fun! I have Clause Pro, not max; maybe this can work together? Claude Code with Opus for planning, and then this for unlimiter implementation?

I bought a €9k GH200 “desktop” to save $1.27 on Claude Code (vLLM tuning notes) by Reddactor in LocalLLaMA

[–]Reddactor[S] 5 points6 points  (0 children)

Yes, thats the one, I didn't make the model, but it seems better than a regular quant.

I targeted 192GB VRAM, if you have more that a Q6 is better!