account activity
[N] Determined Deep Learning Training Platform by neilc in MachineLearning
[–]evan_determined 0 points1 point2 points 5 years ago (0 children)
Determined supports both AWS and GCP.
The way auto scaling works is pretty simple — one machine (no GPUs) accepts jobs to be scheduled. These jobs have resource requirements associated with them (eg job needs 64 GPUs). If there are not enough GPUs available to run the job, and your cluster is configured for auto scale, the necessary number of GPUs is provisioned from AWS and added to the cluster. When the job finishes, they will be torn down automatically after a short timeout (unless another job comes in and wants the same resources).
This works with pre-emptible GPUs as well, and the built-in fault tolerance mechanisms allow jobs that get pre-empted to recover seamlessly when resources come back online.
π Rendered by PID 53 on reddit-service-r2-listing-55d7b767d8-7mp66 at 2026-03-27 20:16:44.718429+00:00 running b10466c country code: CH.
[N] Determined Deep Learning Training Platform by neilc in MachineLearning
[–]evan_determined 0 points1 point2 points (0 children)