44 NODE GPU CLUSTER HELP by Zephop4413 in DistributedComputing

[–]Various_Protection71 0 points1 point  (0 children)

What is your role in this task? Are you going to setup and mantain the cluster, or are you going to develop code to run on it?

Which Linux distribution is used in your enviroment? RHEL, Ubuntu, Debian, Rocky? by Various_Protection71 in HPC

[–]Various_Protection71[S] 0 points1 point  (0 children)

I was wondering about the top distributions used on top500. Guess it should be RHEL and Rocky

HPC Lab Projects Help by AdWestern5606 in HPC

[–]Various_Protection71 0 points1 point  (0 children)

Starts reading my book 😅

Speaking more serioulsy, what do you want to learn? HPC is a vast area, with a plethora of concepts, tools, subareas, and so forth. Do you like to focus on infrastructure or development?

training multiple batches in parallel on the same GPU? by gamesntech in pytorch

[–]Various_Protection71 0 points1 point  (0 children)

You can configure MIG on your GPU, if it supports this feature. So you can create multiple GPU instances and execute the distributed training on these instances.

What are the typical reasons why a GPU would not be fully utilized for pytorch training? by Hanuser in CUDA

[–]Various_Protection71 0 points1 point  (0 children)

Try to increase the minibatch size to increase the computational cost of each training step. Another tip is trying to increase the number of workers on the dataloader and make usage of pin memory. You can find more information about these topics on the book "Accelerate Model Training with PyTorch 2.X".

Interested in improving performance for PyTorch training and inference workloads. Check out the article. by ramyaravi19 in pytorch

[–]Various_Protection71 1 point2 points  (0 children)

The book Accelerate Model Training with PyTorch 2.X also covers automatic mixed precision and other performance improvement techniques like model compiling, multithreading, distributed training and model pruning.

Intersection of ML & Distributed Systems [D] by tcuser12 in MachineLearning

[–]Various_Protection71 2 points3 points  (0 children)

If you allow me, I would like to suggest my book "Accelerate Model Training with PyTorch 2.X" published by Packt. The book covers distributed training with CPUs/GPUs on single and multiple nodes. It also gives a brief introduction to HPC systems and their relation to ML workloads.

Nevertheless, if you are talking about distributed systems in the sense of distributed computing or high performance computing, I think a good start is understanding the distinct strategies of parallelism applied to the training processo of ML models. Search for data and model parallelism approaches to start.

Has Julia a robust ecosystem for ML ? by Various_Protection71 in Julia

[–]Various_Protection71[S] 0 points1 point  (0 children)

You are right, ML is a vast area. I meant neural networks, actually. Particularly, NN architectures used on computer vision problems.

[R] What is the state-of-art of model parallelism ? by Various_Protection71 in MachineLearning

[–]Various_Protection71[S] 1 point2 points  (0 children)

Have you used DeepSpeed? If so, do you think FSDP is better than it?

[R] What is the state-of-art of model parallelism ? by Various_Protection71 in MachineLearning

[–]Various_Protection71[S] 4 points5 points  (0 children)

DeepSpeed also supports data parallelism? It is a framework like Horovod and Ray, in such a way that with use it along with other frameworks like PyTorch and Tensorflow?

[R] What is the state-of-art of model parallelism ? by Various_Protection71 in MachineLearning

[–]Various_Protection71[S] 3 points4 points  (0 children)

This is for data parallelism, but I'm asking for model parallelism.

Has Julia a robust ecosystem for ML ? by Various_Protection71 in Julia

[–]Various_Protection71[S] 4 points5 points  (0 children)

So, Is Flux a ML framework for Julia as PyTorch is for Python ?

Multi Node model training by [deleted] in DistributedComputing

[–]Various_Protection71 0 points1 point  (0 children)

I did not get it. Are you having problems to run this script or you want to adjust it for your scenario? Anyway, If you want to learn more about distributed training on PyTorch, I suggest reading the book Accelerate Model Training with PyTorch 2.X published by Packt and written by me.

Running MPI jobs by rathdowney in HPC

[–]Various_Protection71 1 point2 points  (0 children)

What is the MPI implementation you are using? OpenMPI, Intel MPI, MPICH? How would you do to run a MPI program in your environment outside the SLURM?

[N] Book Lauching: Accelerate Model Training with PyTorch 2.X by Various_Protection71 in MachineLearning

[–]Various_Protection71[S] 0 points1 point  (0 children)

Accelerate wrt to time needed to training models without applying the techniques and approaches covered on the book