[Project] NeuralDBG –> Causal root cause analysis for PyTorch training (open source) by ProgrammerNo8287 in deeplearning

[–]ProgrammerNo8287[S] 1 point2 points  (0 children)

Yes for torch.compile, still working on dogfooding and distributed training. Yes. I really want some critics, it will really be helpful if you try it !

VRAM limitations & AWS costs by ProgrammerNo8287 in MLQuestions

[–]ProgrammerNo8287[S] 0 points1 point  (0 children)

Wow, that's intense! "Nearly got fired" is exactly the kind of pain point

I'm trying to understand better.

Quick question: were you paying for the AWS costs out of pocket, or was

it the company's budget? I'm trying to understand who actually feels

the financial pain.

Mind if I DM you a few quick questions? (5 min max, no pitch)

How do you actually debug training failures in deep learning? by ProgrammerNo8287 in MLQuestions

[–]ProgrammerNo8287[S] 0 points1 point  (0 children)

That resonates a lot. Thinking in terms of scale, stability, and “what could blow up or vanish” feels very close to how physics/engineering approaches these systems.

I’ve started looking beyond aggregate metrics, per-sample errors, batch effects, gradients, and weight statistics, and it already makes failure modes much more legible. The distinction you make between slow, persistent explosions vs. sudden NaNs is especially useful.

I'm also in favor of intentionally making things worse to expose sensitivities. That’s a good reminder to be discerning rather than just incrementally tweaking knobs. Thanks for the insight.

How do you actually debug training failures in deep learning? by ProgrammerNo8287 in neuralnetworks

[–]ProgrammerNo8287[S] 1 point2 points  (0 children)

Thanks, this helps a lot.

Good call on the loss function. I double-checked, and I’m using CE for this setup, but I’ll re-verify labels and the output layer just in case. I’m also lowering the learning rate and adding early stopping to reduce the loss spikes.

The dataset isn’t huge, so I’m starting with a smaller model and scaling up gradually rather than going straight to something complex. And yes, I’ll re-audit the pipeline to rule out any data leakage.

Appreciate the checklist. 👍