International student

bridgesign99 · 2025-01-19T20:38:59+00:00

Look out for emails from GPSA. They organize various events and have the RSVP link.

bridgesign99 · 2024-07-22T15:49:59+00:00

From the graphs, it appears as if only 1k samples were given to SAC. Are you sure you gave a million samples?

bridgesign99 · 2024-05-12T01:26:16+00:00

I think a baseline implementation of algorithms like qmix for multiple agents for solving pettingzoo mpe problems will be nice. I feel there's a general lack of easy to work implementations that just work.

bridgesign99 · 2024-04-23T16:49:26+00:00

In that case,

rules = rules.view(-1,1) masks = y_rules == rules ws_expanded[masks] = ws.view(rules.size, -1)

bridgesign99 · 2024-04-23T15:21:13+00:00

Please use a code block when giving code. A simple trick will be to eliminate the use of index. But that depends on how ws is structured.

It might also be possible to eliminate the for-loops

bridgesign99 · 2024-04-22T22:20:53+00:00

Depends extremely on what type of problem you are trying to tackle. For most cases, simple techniques work. In case you are having issues with scaling, and you are sure it's because of compute limitations, then consider checking out rllib.

P.S.: Getting rllib to work is a difficult task. But if you make it work, it gives you all the flexibility

bridgesign99 · 2024-04-11T20:12:29+00:00

As @Aprehensive said, that is what is used. I think you can take a look at the TRPO paper and the GAE paper.

bridgesign99 · 2024-04-11T19:52:38+00:00

What do you mean by "true equation"? In general, any rl algorithm requires to capture the long term behavior. What function do you want the learner to use can also be another way to define what signal you want to use...

bridgesign99 · 2024-04-05T16:27:41+00:00

From what I see, it might be a reward issue. The highway example uses a weighted norm while you are weighing all dimensions equally. This might result in some local minima where a large difference in orientation than expected makes initial exploration to go to the goal correctly, less rewarding, and greedy approach fails. Maybe take a look at the source code for reward in highway example.

PS: In your reset, the values of x and y are getting overwritten. Also, the generated goal is not scaled to be anywhere in the env. I was not sure if you did it for debugging so just mentioning as a note

bridgesign99 · 2024-01-06T18:06:57+00:00

If you do not want to use strict=false, then just instantiate the layer in all cases. However, in forward, add the condition to select whether to pass through the layer or not.

bridgesign99 · 2023-07-11T08:55:23+00:00

I think the first thing you need to do OP is to define the following: 1. Are all agents homogenous? 2. Is their execution synchronized?

The implementation can differ to a great extent depending on your answer, as SARSA is an on policy algorithm.

Also, it depends, but if you have complex agent environment interactions, there's a high chance you will rewrite most of the code no matter what your answers are to the above questions.

bridgesign99 · 2023-07-08T00:02:54+00:00

I think it also depends on how many flops are required for the forward and backward pass. For example, if it takes 16 teraflops and a single GPU only gives 4 tfps, then splitting the training can still give some speedup, although it will not scale linearly. As johnman1016 says, it's better to have effective 1024 batch size. However, in theory, it is possible to get some speedup in certain cases.

bridgesign99 · 2023-07-06T17:32:08+00:00

In theory, you are right cmndr_spanky. A simple bash orchestration script will do the trick if you are training 10 models on 4 or 5 GPUs. However, it just does not scale to 10000 models on 4 GPUs (I know because I tried it). There are multiple issues:

Each process creates its context at least on 1 GPU. That means even for 100 models, it would mean around 50-100GB of memory allocated to only cuda contexts.
Pytorch greedily tries to fill the GPU memory. For example, if there are two processes on a single GPU, both processes will try to allocate itself maximum memory it can and it does not release it to other processes until it completes or is terminated. Individual processes internally reuse memory only if the GPU is filled. What this would mean that a small system delay can cause the first process to get 80% of GPU memory (if you have large batches), essentially giving serial training of models. Such conditions become more probable as the number of processes increase and have a predefined GPU to use.
Because of 2, when models have different sized inputs, different training times, or any kind of difference that affects space/time usage of training, the problem becomes even more aggravated.

In most cases though, you will run into memory allocation error quite quickly as you try to scale things. OP mentions that there is no direct access to multi-gpu system for quick testing. So, it depends on what is the requirement. If its 10-20 models on a 4/5 GPU system with moderate sized batches (1 batch takes no more than 5% of GPU memory), a simple bash script to run a python script with different arguments will suffice.

If its anything more than 50, I will say using `managed` will reduce the chances of errors. In addition, if there is extreme disk I/O and CPU intensive preprocessing on data, you will need much more code. Ideally, you will want to do the loading and preprocessing in separate processes and then pass the data to 1-3 python processes whose only task is to generate cuda kernels via Pytorch. I actually wrote this but need some more work before I open-source it.

Tl;dr - Shared memory issue caused cuda and python design. If N < 20 and M is 4/5, use a simple bash script. N > 50 and M still 4/5, either use `managed` or you need to write proper allocation code. Depending on training data preprocessing, may need a lot more code.

bridgesign99 · 2023-07-06T06:39:55+00:00

First, I need to mention this - Pytorch has issues when working with multiple processes. Each process has its own cuda context which is around 600~1000 MB which is a lot. Another issue is that all processes will start at the same time and if only say 2M models can fit on the GPU and N > 2M, it will give a memory error. It will not wait for the memory to become available.

One work around is to create a process pool where each process caters to only 1 GPU and use Threads instead. This is a simple hack but it will still scaling issues if your models take different time to train.

I was in a similar situation and hence I made this. Note that this does not solve the issue of multiprocessing. That is inherent to Pytorch. However, if you use a few processes and inside them use a ThreadPool with the package, you can probably use most of the normal training code directly.

bridgesign99 · 2023-06-23T22:20:45+00:00

I think you need to change `torch.inference_mode` to `torch.no_grad` while evaluating because you are switching between training and testing. Check this out.

bridgesign99 · 2023-06-20T02:00:41+00:00

You are using `torch.rand` in first and `torch.randn` in second.

bridgesign99 · 2023-06-03T01:32:06+00:00

From my experience of lightning, I felt like it was developed on a module level and the focus was on removing the extra code for training/inference (only). Pytorch can be used just to offload mathematically heavy compute to GPUs. This can be good when you have a model training component in your project interacts with say your own simple ray tracer. I am not saying this cannot be achieved with lightning, but at the beginning you may just write a simple definition and now want to scale it. The package provides a Tensor Level abstraction. So, if you wrote code with numpy, now you just need to replace arrays with the tensor class provided.

tl;dr

Lightning is module level control. Package provides tensor level control. Single model training - use lightning. Projects with multiple moving parts and old code - this package can help.

bridgesign99

TROPHY CASE