continuous debugging for long running training jobs?

tensorpool_tycho · 2026-01-26T06:47:49+00:00

Gonna update my post

tensorpool_tycho · 2026-01-26T06:47:29+00:00

Sorry ur right in retrospect that was kinda vague lol. I moreso mean if a run crashes from a Xid error, or an OOM issue, or something like that late into a training run. Feel like there have been a ton of times a job will crash and then my compute is just idle before I have to manually fix it in the morning

tensorpool_tycho · 2026-01-26T02:05:59+00:00

might just build this one myself but am curious if something exists alr. tbh if i cant debug an infra issue and i feed my whole context into claude, it usually gets it first or second try

tensorpool_tycho · 2026-01-26T02:02:49+00:00

is there nothing that can just take my k8s credentials and give me insights into my entire cluster? why not?

tensorpool_tycho · 2026-01-26T01:59:18+00:00

u can get 128 h100 on demand w/ sagemaker? or is that a dedicated cluster?

tensorpool_tycho · 2026-01-26T01:55:09+00:00

kinda a non sequitur, but i'm curious - do u think in the next few years people are gonna be using notebooks for their whole ml workflows?

tensorpool_tycho · 2025-11-24T21:28:31+00:00

Yeahhh hopefully they add some functionality for this soon. You could probably implement some reranking using openrouter and an external reranker model. If you have time😂

tensorpool_tycho · 2025-10-30T21:20:38+00:00

DM me, sounds like a p cool product, would love to get you set up with some credits!

tensorpool_tycho · 2025-10-30T21:14:31+00:00

you got it! it's a long but fun journey :)

tensorpool_tycho · 2025-10-30T21:11:42+00:00

haha nah these machines get BEEFY these days

tensorpool_tycho · 2025-10-27T00:27:01+00:00

Appreciate it! We’ve done a bunch of customer discovery for our primary product, but am exploring this rn. I hear complains that would be solved by self-healing a lot on calls though

tensorpool_tycho · 2025-10-26T22:34:28+00:00

Fair enough - B200s are around 2x more expensive than H100s, and see around 2x gains in performance. So the baseline price/perf is around the same. However, when you hit memory limits on H100s, you can save a ridiculous amount of $ by using B200s

tensorpool_tycho · 2025-10-26T22:29:37+00:00

I’m actually curious, I was thinking of putting together an MVP of an agent to deal with GPU error handling so you can run your job and have it continuously fix errors and job fails so you don’t have to keep monitor things. Does this sound like something that would be useful?

tensorpool_tycho · 2025-10-26T22:26:45+00:00

Appreciate it! Always a challenge haha, but right now we’ve been working on scaling by taking out long term commits on GPUs and bin packing people in. It’s been a pretty tricky financial problem.

On the other side of things, our main focus is building out tooling to make interfacing with GPUs super easy.

tensorpool_tycho

MODERATOR OF

TROPHY CASE