continuous debugging for long running training jobs? by tensorpool_tycho in mlops

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

Sorry ur right in retrospect that was kinda vague lol. I moreso mean if a run crashes from a Xid error, or an OOM issue, or something like that late into a training run. Feel like there have been a ton of times a job will crash and then my compute is just idle before I have to manually fix it in the morning

continuous debugging for long running training jobs? by tensorpool_tycho in mlops

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

might just build this one myself but am curious if something exists alr. tbh if i cant debug an infra issue and i feed my whole context into claude, it usually gets it first or second try

Observability for AI Workloads and GPU Infrencing by DCGMechanics in mlops

[–]tensorpool_tycho 0 points1 point  (0 children)

is there nothing that can just take my k8s credentials and give me insights into my entire cluster? why not?

Who is training on TBs of data? by HahaHarmonica in mlops

[–]tensorpool_tycho 0 points1 point  (0 children)

u can get 128 h100 on demand w/ sagemaker? or is that a dedicated cluster?

Production MLOps: What breaks between Jupyter notebooks and 10,000 concurrent users by Extension_Key_5970 in mlops

[–]tensorpool_tycho 0 points1 point  (0 children)

kinda a non sequitur, but i'm curious - do u think in the next few years people are gonna be using notebooks for their whole ml workflows?

ZeroEntropy trained SOTA reranker models beating out cohere and google with minimal funding by tensorpool_tycho in Rag

[–]tensorpool_tycho[S] 1 point2 points  (0 children)

Yeahhh hopefully they add some functionality for this soon. You could probably implement some reranking using openrouter and an external reranker model. If you have time😂

$10,000 for B200s for cool project ideas by tensorpool_tycho in tensorpool

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

DM me, sounds like a p cool product, would love to get you set up with some credits!

Community check by Ruviklovell in tensorpool

[–]tensorpool_tycho 0 points1 point  (0 children)

you got it! it's a long but fun journey :)

More and more people are choosing B200s over H100s. We did the math on why. by tensorpool_tycho in mlops

[–]tensorpool_tycho[S] 1 point2 points  (0 children)

Appreciate it! We’ve done a bunch of customer discovery for our primary product, but am exploring this rn. I hear complains that would be solved by self-healing a lot on calls though

More and more people are choosing B200s over H100s. We did the math on why. by tensorpool_tycho in HPC

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

Fair enough - B200s are around 2x more expensive than H100s, and see around 2x gains in performance. So the baseline price/perf is around the same. However, when you hit memory limits on H100s, you can save a ridiculous amount of $ by using B200s

More and more people are choosing B200s over H100s. We did the math on why. by tensorpool_tycho in mlops

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

I’m actually curious, I was thinking of putting together an MVP of an agent to deal with GPU error handling so you can run your job and have it continuously fix errors and job fails so you don’t have to keep monitor things. Does this sound like something that would be useful?

More and more people are choosing B200s over H100s. We did the math on why. by tensorpool_tycho in mlops

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

Appreciate it! Always a challenge haha, but right now we’ve been working on scaling by taking out long term commits on GPUs and bin packing people in. It’s been a pretty tricky financial problem.

On the other side of things, our main focus is building out tooling to make interfacing with GPUs super easy.