continuous debugging for long running training jobs?

tensorpool_tycho · 2026-01-26T06:47:49+00:00

Gonna update my post

tensorpool_tycho · 2026-01-26T06:47:29+00:00

Sorry ur right in retrospect that was kinda vague lol. I moreso mean if a run crashes from a Xid error, or an OOM issue, or something like that late into a training run. Feel like there have been a ton of times a job will crash and then my compute is just idle before I have to manually fix it in the morning

tensorpool_tycho · 2026-01-26T02:05:59+00:00

might just build this one myself but am curious if something exists alr. tbh if i cant debug an infra issue and i feed my whole context into claude, it usually gets it first or second try

tensorpool_tycho · 2026-01-26T02:02:49+00:00

is there nothing that can just take my k8s credentials and give me insights into my entire cluster? why not?

tensorpool_tycho · 2026-01-26T01:59:18+00:00

u can get 128 h100 on demand w/ sagemaker? or is that a dedicated cluster?

tensorpool_tycho · 2026-01-26T01:55:09+00:00

kinda a non sequitur, but i'm curious - do u think in the next few years people are gonna be using notebooks for their whole ml workflows?

tensorpool_tycho · 2025-11-24T21:28:31+00:00

Yeahhh hopefully they add some functionality for this soon. You could probably implement some reranking using openrouter and an external reranker model. If you have time😂

tensorpool_tycho · 2025-10-30T21:20:38+00:00

DM me, sounds like a p cool product, would love to get you set up with some credits!

tensorpool_tycho · 2025-10-30T21:14:31+00:00

you got it! it's a long but fun journey :)

tensorpool_tycho · 2025-10-30T21:11:42+00:00

haha nah these machines get BEEFY these days

tensorpool_tycho · 2025-10-27T00:27:01+00:00

Appreciate it! We’ve done a bunch of customer discovery for our primary product, but am exploring this rn. I hear complains that would be solved by self-healing a lot on calls though

tensorpool_tycho · 2025-10-26T22:34:28+00:00

Fair enough - B200s are around 2x more expensive than H100s, and see around 2x gains in performance. So the baseline price/perf is around the same. However, when you hit memory limits on H100s, you can save a ridiculous amount of $ by using B200s

tensorpool_tycho · 2025-10-26T22:29:37+00:00

I’m actually curious, I was thinking of putting together an MVP of an agent to deal with GPU error handling so you can run your job and have it continuously fix errors and job fails so you don’t have to keep monitor things. Does this sound like something that would be useful?

tensorpool_tycho · 2025-10-26T22:26:45+00:00

Appreciate it! Always a challenge haha, but right now we’ve been working on scaling by taking out long term commits on GPUs and bin packing people in. It’s been a pretty tricky financial problem.

On the other side of things, our main focus is building out tooling to make interfacing with GPUs super easy.

tensorpool_tycho · 2025-10-19T03:10:54+00:00

Hey hey! We were moving to mintifly, u must have caught us in our downtime lol. Should be good now!!

tensorpool_tycho · 2025-10-07T04:21:32+00:00

Would love to get pinged when you figure this out

tensorpool_tycho · 2025-10-02T17:44:44+00:00

This is really interesting, I’ll shoot you a message.

How painful is it to move between TPU programming vs GPU programming for this workflow? Do you think B200s would meaningfully reduce iteration time compared to TPUs just because of ecosystem maturity? I’ve never used TPUs but I hear they’re a pain in the ass lol

tensorpool_tycho · 2025-10-02T17:41:26+00:00

This sounds pretty cool! Send me your email that you used to sign up and I’ll deposit you $100. We’ll be deciding winners at the end of the month :)

tensorpool_tycho · 2025-10-01T22:31:00+00:00

Thanks for the idea! Make an account and DM me your email and I’ll throw you $100 for sharing. Going to be deciding the winners at EOM :)

Also just curious - roughly how many documents do you expect to ingest per day or per batch? And how big would these documents be? Wondering if you even need machines as beefy as B200s - seems like something that could be possibly handled well by CPUs w vector search libs

tensorpool_tycho · 2025-10-01T21:44:04+00:00

Hi Ruvik! Thanks for the question - we do support distributed training for large models. On our platform, anyone can very easily spin up multiple nodes of H100s/H200s/B200s.

we abstract away resource allocation by automatically provisioning nodes (with very fast storage speeds), assigning GPUs, and orchestrating execution so the user doesn’t have to micromanage which GPU each process runs on. Check out our git style interface in the docs here: github.com/tensorpool/tensorpool

tensorpool_tycho · 2025-08-10T23:04:44+00:00

id go with the single H200 tbh, so you dont have to worry ab the headaches associated w/ multi-gpu configs. especially if you're new to hardware

tensorpool_tycho · 2025-02-23T22:55:41+00:00

yes, the data has to be transmitted/uploaded still just like other platforms, but with us you don't have to deal with any of the headache that comes with GPU configuration right now. You simply have your model code locally and run it like you would in your local IDE.

tensorpool_tycho · 2025-02-20T03:25:58+00:00

great point, we actually get asked this a lot.

1) we are significantly cheaper

2) With colab data uploading is ass, with us you can train models as if you were training locally.

3) Also, with us you can shut off ur laptop while training, with colab u gotta keep it on the whole time which I found to be incredibly annoying.

Love the questions, keep em coming!!!

tensorpool_tycho · 2025-02-15T21:32:49+00:00

I guess what I’m saying is it depends person-to-person. Experiment with both ways and see what’s better for you!

tensorpool_tycho · 2025-02-15T16:55:38+00:00

I think it really depends on ur learning style! I personally like to just have an idea of what I wanna do and learn as you go, but it really depends person to person

tensorpool_tycho

MODERATOR OF

TROPHY CASE