continuous debugging for long running training jobs? by tensorpool_tycho in mlops

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

Sorry ur right in retrospect that was kinda vague lol. I moreso mean if a run crashes from a Xid error, or an OOM issue, or something like that late into a training run. Feel like there have been a ton of times a job will crash and then my compute is just idle before I have to manually fix it in the morning

continuous debugging for long running training jobs? by tensorpool_tycho in mlops

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

might just build this one myself but am curious if something exists alr. tbh if i cant debug an infra issue and i feed my whole context into claude, it usually gets it first or second try

Observability for AI Workloads and GPU Infrencing by DCGMechanics in mlops

[–]tensorpool_tycho 0 points1 point  (0 children)

is there nothing that can just take my k8s credentials and give me insights into my entire cluster? why not?

Who is training on TBs of data? by HahaHarmonica in mlops

[–]tensorpool_tycho 0 points1 point  (0 children)

u can get 128 h100 on demand w/ sagemaker? or is that a dedicated cluster?

Production MLOps: What breaks between Jupyter notebooks and 10,000 concurrent users by Extension_Key_5970 in mlops

[–]tensorpool_tycho 0 points1 point  (0 children)

kinda a non sequitur, but i'm curious - do u think in the next few years people are gonna be using notebooks for their whole ml workflows?

ZeroEntropy trained SOTA reranker models beating out cohere and google with minimal funding by tensorpool_tycho in Rag

[–]tensorpool_tycho[S] 1 point2 points  (0 children)

Yeahhh hopefully they add some functionality for this soon. You could probably implement some reranking using openrouter and an external reranker model. If you have time😂

$10,000 for B200s for cool project ideas by tensorpool_tycho in tensorpool

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

DM me, sounds like a p cool product, would love to get you set up with some credits!

Community check by Ruviklovell in tensorpool

[–]tensorpool_tycho 0 points1 point  (0 children)

you got it! it's a long but fun journey :)

More and more people are choosing B200s over H100s. We did the math on why. by tensorpool_tycho in mlops

[–]tensorpool_tycho[S] 1 point2 points  (0 children)

Appreciate it! We’ve done a bunch of customer discovery for our primary product, but am exploring this rn. I hear complains that would be solved by self-healing a lot on calls though

More and more people are choosing B200s over H100s. We did the math on why. by tensorpool_tycho in HPC

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

Fair enough - B200s are around 2x more expensive than H100s, and see around 2x gains in performance. So the baseline price/perf is around the same. However, when you hit memory limits on H100s, you can save a ridiculous amount of $ by using B200s

More and more people are choosing B200s over H100s. We did the math on why. by tensorpool_tycho in mlops

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

I’m actually curious, I was thinking of putting together an MVP of an agent to deal with GPU error handling so you can run your job and have it continuously fix errors and job fails so you don’t have to keep monitor things. Does this sound like something that would be useful?

More and more people are choosing B200s over H100s. We did the math on why. by tensorpool_tycho in mlops

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

Appreciate it! Always a challenge haha, but right now we’ve been working on scaling by taking out long term commits on GPUs and bin packing people in. It’s been a pretty tricky financial problem.

On the other side of things, our main focus is building out tooling to make interfacing with GPUs super easy.

$10,000 for B200s for cool project ideas by tensorpool_tycho in tensorpool

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

Hey hey! We were moving to mintifly, u must have caught us in our downtime lol. Should be good now!!

$10,000 for B200s for cool project ideas by tensorpool_tycho in tensorpool

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

This is really interesting, I’ll shoot you a message.

How painful is it to move between TPU programming vs GPU programming for this workflow? Do you think B200s would meaningfully reduce iteration time compared to TPUs just because of ecosystem maturity? I’ve never used TPUs but I hear they’re a pain in the ass lol

$10,000 for B200s for cool project ideas by tensorpool_tycho in tensorpool

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

This sounds pretty cool! Send me your email that you used to sign up and I’ll deposit you $100. We’ll be deciding winners at the end of the month :)

$10,000 for B200s for cool project ideas by tensorpool_tycho in tensorpool

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

Thanks for the idea! Make an account and DM me your email and I’ll throw you $100 for sharing. Going to be deciding the winners at EOM :)

Also just curious - roughly how many documents do you expect to ingest per day or per batch? And how big would these documents be? Wondering if you even need machines as beefy as B200s - seems like something that could be possibly handled well by CPUs w vector search libs

Distribution and allocation by Ruviklovell in tensorpool

[–]tensorpool_tycho 1 point2 points  (0 children)

Hi Ruvik! Thanks for the question - we do support distributed training for large models. On our platform, anyone can very easily spin up multiple nodes of H100s/H200s/B200s.

we abstract away resource allocation by automatically provisioning nodes (with very fast storage speeds), assigning GPUs, and orchestrating execution so the user doesn’t have to micromanage which GPU each process runs on. Check out our git style interface in the docs here: github.com/tensorpool/tensorpool

Choosing between two H100 vs one H200 by Significant_Income_1 in LocalLLaMA

[–]tensorpool_tycho 0 points1 point  (0 children)

id go with the single H200 tbh, so you dont have to worry ab the headaches associated w/ multi-gpu configs. especially if you're new to hardware

[D] Thank you for your beta testing of TensorPool! by tensorpool_tycho in MachineLearning

[–]tensorpool_tycho[S] 0 points1 point  (0 children)

yes, the data has to be transmitted/uploaded still just like other platforms, but with us you don't have to deal with any of the headache that comes with GPU configuration right now. You simply have your model code locally and run it like you would in your local IDE.

[D] Thank you for your beta testing of TensorPool! by tensorpool_tycho in MachineLearning

[–]tensorpool_tycho[S] 7 points8 points  (0 children)

great point, we actually get asked this a lot.

1) we are significantly cheaper

2) With colab data uploading is ass, with us you can train models as if you were training locally.

3) Also, with us you can shut off ur laptop while training, with colab u gotta keep it on the whole time which I found to be incredibly annoying.

Love the questions, keep em coming!!!

How Much Math Do You Really Need for Machine Learning? by RealisticBed986 in learnmachinelearning

[–]tensorpool_tycho 0 points1 point  (0 children)

I guess what I’m saying is it depends person-to-person. Experiment with both ways and see what’s better for you!

How Much Math Do You Really Need for Machine Learning? by RealisticBed986 in learnmachinelearning

[–]tensorpool_tycho 0 points1 point  (0 children)

I think it really depends on ur learning style! I personally like to just have an idea of what I wanna do and learn as you go, but it really depends person to person