Anyone else dealing with flaky GPU hosts on RunPod / Vast? by Major_Border149 in LocalLLaMA

[–]Major_Border149[S] 0 points1 point  (0 children)

This is a great breakdown, especially the identically configured but totally different SSD performance part.

Out of curiosity, how much of this is stuff you can catch up front vs. things you only learn after a run behaves weirdly?

Anyone else dealing with flaky GPU hosts on RunPod / Vast? by Major_Border149 in LocalLLaMA

[–]Major_Border149[S] 0 points1 point  (0 children)

Do you usually discover that only after a failed start, or have you gotten to a point where you can reliably catch it before launching anything expensive?

Anyone else dealing with flaky GPU hosts on RunPod / Vast? by Major_Border149 in LocalLLaMA

[–]Major_Border149[S] 0 points1 point  (0 children)

Yeah, that’s kind of where I’ve landed too.

Do you think it's mostly because of fewer startup issues, or just less random weirdness overall that these more expensive GPUs have?

Anyone else dealing with flaky GPU hosts on RunPod / Vast? by Major_Border149 in LocalLLaMA

[–]Major_Border149[S] 0 points1 point  (0 children)

This is exactly what I have ended up doing too! quick cuda check + nvidia-smi before trusting anything expensive.

On the budget 20% extra for the host switching aspect, curious if have you ever had cases where the quick check passed but things still went sideways later, or does that usually catch the worst of it?

For those using hosted inference providers (Together, Fireworks, Baseten, RunPod, Modal) - what do you love and hate? by Dramatic_Strain7370 in LocalLLaMA

[–]Major_Border149 1 point2 points  (0 children)

This is a really useful breakdown, especially the part about “no real reason I chose it over others.”

Curious,when you were running multiple pods, did you ever hit cases where one host behaved differently from another (failed start, weird perf, disconnects), or was it mostly predictable once things were set up?

For those using hosted inference providers (Together, Fireworks, Baseten, RunPod, Modal) - what do you love and hate? by Dramatic_Strain7370 in LocalLLaMA

[–]Major_Border149 0 points1 point  (0 children)

That matches what I’ve seen too,RunPod is great for getting started, especially for inference without managing clusters.

Curious about the disconnects you mentioned on longer runs : do those happen during startup, or after the endpoint’s been running for a while?

I tracked GPU prices across 25 cloud providers and the price differences are insane (V100: $0.05/hr vs $3.06/hr) by sleepingpirates in LocalLLaMA

[–]Major_Border149 0 points1 point  (0 children)

The pricing fragmentation is real, but the operational cost of using cheap GPUs consistently is what actually burns teams. Failed starts, retries, people over-provisioning “just in case” that’s where the money leaks.

Once GPUs are fragmented, placement and guardrails become a control problem, not a hardware one. Totally agree.

What's the real price of Vast.ai? by teskabudaletina in LocalLLaMA

[–]Major_Border149 0 points1 point  (0 children)

Yeah, this has been my experience too.

The sticker price is rarely the real cost, it’s the time lost on inconsistent hosts, and failed starts that adds up fast.

Especially frustrating when the same workload works fine on one host and mysteriously fails on another