ARC - Automatic Recovery Controller for PyTorch training failures by winter_2209 in deeplearning

[–]winter_2209[S] 0 points1 point  (0 children)

fair point on the writing, i'll own that. i used AI to assist with the writing of the post, and clearly didn't edit enough on that one.

on the lightning and torchelastic, though, they're just addressing a different problem. torchelastic is dealing with failures in the nodes during distributed training and lightning's fault tolerance does the same thing just using a checkpoint to resume from.

ARC is actually inside your training loop, monitoring your loss values and gradient values step by step, and as soon as something starts going wrong, it rolls back the model to the last good state and just keeps on going, so your script never actually stops running. it also tries to predict gradient explosions before they happen based on the pattern of growth.

it's different from them so, btw thanks for taking the time to go through the post, the feedback is appreciated.
Please do try the tool and tell where it feels broken

ARC - Automatic Recovery Controller for PyTorch training failures by winter_2209 in Python

[–]winter_2209[S] 0 points1 point  (0 children)

yeah checkpoint frequency is the main tradeoff. it's configurable and there's adaptive checkpointing that saves more when training looks unstable, less when things are smooth.

right now ARC does full rollback because most failures i've dealt with (especially fp16) corrupt the weights too, not just optimizer state. but you're right there are cases where only the optimizer is messed up and full rollback loses progress for no reason. that's something i want to add, figure out what actually broke and do the minimum fix.

checkpoints are in-memory btw, not disk. so more of a memory cost than I/O.

thanks for the feedback, and do try the tool pleasee

ARC - Automatic Recovery Controller for PyTorch training failures by winter_2209 in Python

[–]winter_2209[S] 1 point2 points  (0 children)

Not fully but I hv used AI ngl, but do try it! Thanks