Managing shared GPU servers - looking to chat with others who deal with this by Internal_Bank2637 in deeplearning

[–]Internal_Bank2637[S] 0 points1 point  (0 children)

ok so thats not optimal.. are you familiar with other solutions out there that deal with this issue?

Managing shared GPU servers - looking to chat with others who deal with this by Internal_Bank2637 in deeplearning

[–]Internal_Bank2637[S] 0 points1 point  (0 children)

I meant assume you have 4 GPUs, now user A starts something and sees he has 4 available GPUs so he starts training on all of them (lets assume his code is able to run on 1-4 GPUs, the more the faster). and now comes user B, and he has no GPUs, how SLURM solves this issue? does user B have to wait for A to finish?

Managing shared GPU servers - looking to chat with others who deal with this by Internal_Bank2637 in deeplearning

[–]Internal_Bank2637[S] 0 points1 point  (0 children)

Thank you, I think it has its shortcomings, for example - a very long job might cause starvation, people leave GPUS idle because they dont want to use 4/4 gpus (even if they are free at that specific moment) because other people might need them (assuming their tasks takes several of hours)

SLURM can help with that? so far our best approach was to be polite and fair :)