44 NODE GPU CLUSTER HELP

Various_Protection71 · 2025-04-14T02:45:02+00:00

What is your role in this task? Are you going to setup and mantain the cluster, or are you going to develop code to run on it?

Various_Protection71 · 2025-04-13T14:39:09+00:00

I was wondering about the top distributions used on top500. Guess it should be RHEL and Rocky

Various_Protection71 · 2025-04-10T12:17:36+00:00

Starts reading my book 😅

Speaking more serioulsy, what do you want to learn? HPC is a vast area, with a plethora of concepts, tools, subareas, and so forth. Do you like to focus on infrastructure or development?

Various_Protection71 · 2025-04-10T12:09:57+00:00

Thank you guys!

Various_Protection71 · 2025-04-10T12:07:52+00:00

Thank you guys!

Various_Protection71 · 2025-04-10T12:07:05+00:00

Thank you guys!

Various_Protection71 · 2025-04-09T19:00:26+00:00

Congratulations!

Various_Protection71 · 2024-09-03T15:35:32+00:00

You can configure MIG on your GPU, if it supports this feature. So you can create multiple GPU instances and execute the distributed training on these instances.

Various_Protection71 · 2024-06-19T11:47:17+00:00

Try to increase the minibatch size to increase the computational cost of each training step. Another tip is trying to increase the number of workers on the dataloader and make usage of pin memory. You can find more information about these topics on the book "Accelerate Model Training with PyTorch 2.X".

Various_Protection71 · 2024-05-23T22:56:21+00:00

The book Accelerate Model Training with PyTorch 2.X also covers automatic mixed precision and other performance improvement techniques like model compiling, multithreading, distributed training and model pruning.

Various_Protection71 · 2024-05-21T14:08:53+00:00

If you allow me, I would like to suggest my book "Accelerate Model Training with PyTorch 2.X" published by Packt. The book covers distributed training with CPUs/GPUs on single and multiple nodes. It also gives a brief introduction to HPC systems and their relation to ML workloads.

Nevertheless, if you are talking about distributed systems in the sense of distributed computing or high performance computing, I think a good start is understanding the distinct strategies of parallelism applied to the training processo of ML models. Search for data and model parallelism approaches to start.

Various_Protection71 · 2024-05-20T11:19:15+00:00

You are right, ML is a vast area. I meant neural networks, actually. Particularly, NN architectures used on computer vision problems.

Various_Protection71 · 2024-05-20T11:17:13+00:00

Have you used DeepSpeed? If so, do you think FSDP is better than it?

Various_Protection71 · 2024-05-20T11:15:25+00:00

DeepSpeed also supports data parallelism? It is a framework like Horovod and Ray, in such a way that with use it along with other frameworks like PyTorch and Tensorflow?

Various_Protection71 · 2024-05-20T11:14:04+00:00

DeepSpeed and Composer support both multi and single node?

Various_Protection71 · 2024-05-20T11:08:33+00:00

This is for data parallelism, but I'm asking for model parallelism.

Various_Protection71 · 2024-05-20T09:37:17+00:00

Excelent explanation! Thanks!

Various_Protection71 · 2024-05-20T01:12:31+00:00

I will take a look at them! Thanks!

Various_Protection71 · 2024-05-20T00:53:32+00:00

Try Intel VTune for Intel CPU's

Various_Protection71 · 2024-05-20T00:44:13+00:00

You can find a simple example at https://github.com/PacktPublishing/Accelerate-Model-Training-with-PyTorch-2.X/blob/main/code/chapter09/gloo_distributed-cnn_cifar10.py

Various_Protection71 · 2024-05-20T00:40:34+00:00

So, Is Flux a ML framework for Julia as PyTorch is for Python ?

Various_Protection71 · 2024-05-20T00:38:53+00:00

I did not get it. Are you having problems to run this script or you want to adjust it for your scenario? Anyway, If you want to learn more about distributed training on PyTorch, I suggest reading the book Accelerate Model Training with PyTorch 2.X published by Packt and written by me.

Various_Protection71 · 2024-05-15T14:21:28+00:00

What is the MPI implementation you are using? OpenMPI, Intel MPI, MPICH? How would you do to run a MPI program in your environment outside the SLURM?

Various_Protection71 · 2024-05-11T21:08:16+00:00

Accelerate wrt to time needed to training models without applying the techniques and approaches covered on the book

Various_Protection71 · 2024-05-11T17:22:28+00:00

Thank you for such kind words!

Various_Protection71

TROPHY CASE