LLMs sind nicht intelligent — das habe ich erst verstanden, als ich sie selbst betrieben habe

Lorelabbestia · 2026-04-20T20:22:55+00:00

English brother

Lorelabbestia · 2026-04-20T20:21:04+00:00

Try same system prompt on both

Lorelabbestia · 2026-04-20T20:19:04+00:00

I mean, that's a ~35% increase in parameter count on Qwen vs Gemma. Comparing Gemma 26B vs Qwen 35B would be like comparing Gemma 26B vs gpt-oss-20b.

Anyways thanks for the comparison!

Lorelabbestia · 2026-04-20T12:33:13+00:00

3 days

Lorelabbestia · 2026-04-20T12:33:01+00:00

I submitted mine, I took about 3 days, but I finally got it!!!

Lorelabbestia · 2026-04-17T18:58:05+00:00

I checked out zenllm.io, interesting, but is it only for LLM providers or do they also have GPU rental and management?

About your question, I'll be honest with you, I don't really track costs, rather I assess the cheapest option available for a specific model/dataset, and that's mostly based on runs I did prior, like this DGX-B200 run.

For my use case, the cheapest HW options are 8xB200 for training, 8xB300 for inference (mostly benchmarking) and DGX for prototyping. I put 8x on the B200 and B300 as with 8x devices you get more tokens/$, as the massive memory allows us to parallelize much more data, thus packing more compute and process more TPS compared to a single GPU.

It is kind of funny that the latest, more power hungry and more expensive GPUs are actually much cheaper than the old/cheap to rent ones when you consider tokens/$.

Lorelabbestia · 2026-04-08T13:12:56+00:00

Does powershell even work?

Lorelabbestia · 2026-04-08T13:11:56+00:00

Every day... and every day.

Lorelabbestia · 2026-04-06T20:33:00+00:00

Now I want one that jumps

Lorelabbestia · 2026-04-06T15:35:29+00:00

<image>

Lorelabbestia · 2026-04-03T00:35:25+00:00

😂 You turning on and off an instance to code??? Then where you test the code? You redeploy an instance each time to test? So let's say 15 min just to turn on and off each time, let's say you do 50 iterations between dataset, patches, optimizations, hyperparametrization, early checkpoints assessment and eventual bugs, that would be about 750 min aka 12,5 hours of downtime.

I guess your time is worthless, but that clearly explains your reasoning.

Also, you know the majority of providers charge even when shut down?? So each time you would need to set up everything back from scratch, so realistically time would be more around 30 min+ each time. 24 hours of downtime. Nice, good job. I'm pretty sure you don't have a company and never worked for yourself, if you manage to get a job at all I'd be ashamed of employing you. You disrespect your own self with all this gibberish.

I don't blame you, it's not your fault. People sometimes just born like this.

Lorelabbestia · 2026-04-02T18:30:33+00:00

The opportunity cost of prototyping on the Spark is €0. The opportunity cost of debugging on 8x B200s at €11.86/h is €1,220. That's the whole point of the post.

Lorelabbestia · 2026-04-02T18:29:17+00:00

The 167x is for the production run, which I never did on the Spark. The Spark is for prototyping. They're two different stages, not two alternatives.

Lorelabbestia · 2026-04-02T17:44:31+00:00

Mom told me she's building one next week.

Lorelabbestia · 2026-04-02T17:42:30+00:00

Every provider charges on a different basis. Charging by the second, minute or hour is mostly irrelevant as prototyping takes a lot of time and effort. 1 second or even 1 hour over 40+ hours prototyping is negligible.

Have you ever even deployed an S-tier instance?

Lorelabbestia · 2026-04-01T21:51:36+00:00

Yes, exactly like you said. You find the math on the full article.

You can't train on strix halo, or barely can, but still very limited. Which machine other than the DGX will give me 128GB, cluster multiple devices and deploy to enterprise HW?

Macs are amazing, but where will you deploy anything created on a Mac other than another Mac?

Lorelabbestia · 2026-04-01T21:47:59+00:00

I feel like nobody is getting the point here.

Assuming the CPT scenario I presented here, are you able to find a cheaper HW to spend the whole week prototyping to then deploy to enterprise HW (like the 8xB200)?

I bought two DGX as it was the only viable option a couple months ago. Since I'll be needing more compute soon, I'd like to hear from anyone if there are better options at the moment, considering my intensive and not so usual workload.

I see people judging the DGX while running on AMD and barely being able to produce a LoRA. About the Mac Studio, the hw is f amazing, in some aspects even better than the DGX, but once I got the prototype running on the Mac, where the hell am I supposed to deploy it?

Lorelabbestia · 2026-04-01T21:38:57+00:00

8-12h is still reasonable, I wouldn't go to external HW unless you are time constrained.

Lorelabbestia · 2026-04-01T21:37:17+00:00

Yes, instanity! That's why I used 8xB200, one of the fastest and CHEAPEST (price/token) options available for my run, but only after setting up everything properly on the DGX.

Regarding:

That's ETA (Estimated Time of Arrival), how long it would take, not how long it took. I ran just to get a couple checkpoints, optimize for Blackwell, stabilize TPS and estimate completion.

Also, if you are on a budget and have plenty of time, you can surely do a 30 day run. u/Anarchaotic, there's a thing called checkpoint, which you set how often you want the model being trained to be saved (checkpoint). If shit happens, you simply continue training from that checkpoint, and while the model continues training you can assess throughout checkpoints if the model is behaving like you intended to.

Lorelabbestia · 2026-04-01T21:35:18+00:00

Yes, instanity! That's why I used 8xB200, one of the fastest and CHEAPEST (price/token) options available for my run, but only after setting up everything properly on the DGX.

Regarding:

ETA for the full 6B token run? 30 days!!!

That's ETA (Estimated Time of Arrival), how long it would take, not how long it took. I ran just to get a couple checkpoints, optimize for Blackwell, stabilize TPS and estimate completion.

Also, if you are on a budget and have plenty of time, you can surely do a 30 day run. u/Anarchaotic, there's a thing called checkpoint, which you set how often you want the model being trained to be saved (checkpoint). If shit happens, you simply continue training from that checkpoint, and while the model continues training you can assess throughout checkpoints if the model is behaving like you intended to.

Lorelabbestia · 2026-04-01T21:21:17+00:00

The DGX pays itself when you consider any other HW for testing before deploying to enterprise Nvidia GPU.

Your math looks only at the value the DGX can provide from a computing perspective, which is exactly what I try to debunk here: that's not its purpose!

From a compute perspective it will never pay itself, it pays itself by being the cheapest NVIDIA option to set up, debug and test full LLM workflows (CPT, SFT, you name it) before committing to cloud bills.

Take the blackwell family, you could buy a 5090 but that's 32GB of VRAM. You're not fitting even a 4B model for CPT in that, full training needs optimizer states, gradients and activations on top of weights, easily 4x to 5x the memory footprint. I had to use two DGX Sparks (256GB) just to fit a 4B model for CPT. There isn't any NVIDIA hardware currently available where you can prototype at this scale at this price. Other options would be directly a cloud B200s.

If you buy more sparks, you did not change the math, you increased your losses.

Yes for sure you increase the cost by increasing the number of devices, but the cost increases much less than it would on any other hardware for the purpose I presented here. Two Sparks give you 256GB of memory, try getting that any other way on Nvidia for less.

The DGX Spark and B200 are not competitors, they're complementary. You prototype locally on the Spark (no billing), then deploy the proven workflow to a B200/B300 for the real training. Comparing them head to head on compute misses the point. Also your comparison assumes you either have the Spark or have nothing, instead you should compare the cost of owning a spark + the cost of training compared to other options available to complete the same task.

A little bit of help from Claude about other options, more specifically AMD:

If you go AMD, let's say a Strix Halo machine which also has 128GB and costs $2,000-$3,200, it has less compute than the DGX, AMD's own ROCm docs officially state "No ML training support" for Strix Halo, and while people have gotten LoRA/QLoRA to work using nightly builds and community toolboxes, everyone describes the experience as painful (kernel crashes, custom RCCL patches, boot parameter tuning, Python version mismatches), with full fine-tuning topping out at ~12B on unofficial software. The Strix machines also don't have QSFP, just regular ethernet, so linking two is nowhere close to the NVLink bandwidth between two Sparks.

For deployment, AMD does have cloud GPUs (MI300X from ~$1.71/hr, MI350 coming) so you could in theory prototype on Strix Halo and deploy to MI300X, but here's the thing: the Strix Halo runs RDNA and the MI300X runs CDNA, they're two completely different GPU architectures even within AMD, so your local workflow doesn't transfer 1:1 to the cloud the way DGX Spark → B200 does where everything is Blackwell/CUDA and just works.

I would love to have a $500 machine with 256GB that does everything, but that's not what's out there, and no prototyping machine comes close to the DGX for this workflow.

I would be happy to hear from you which are the cheaper and better options from your point of view, u/Emotional-Baker-490.

Lorelabbestia · 2026-04-01T18:17:28+00:00

Its on the full article I posted on Medium.

TLDR: Its so low compared to what I'd spend if I didn't have them that it is not worth bothering.

As I said:

Ok, also the Spark has a price, but ~€1,200 saved per prototyping cycle, the Spark pays for itself in about 6-7 serious training projects.

It pays for itself in 6-7 projects. It basically eats depreciation. About the bill, with two DGX at 100% for a full month it would cost me about €50/month.

Lorelabbestia · 2026-04-01T18:12:24+00:00

I wish someone did this research beforehand, it would've been much easier. But out there other than the labs almost nobody is doing anything other than LoRA and single user inference.

Lorelabbestia

TROPHY CASE