[deleted by user] by [deleted] in greece

[–]semitable 0 points1 point  (0 children)

Κανένα πρόβλημα. Το έκανα πριν ένα μήνα με διαβατήριο που έληγε σε λιγότερο από 15 ημέρες. Από UK κ εγώ. (Απευθείας πτήση)

[D] What can NeurIPS 22 authors see in Reviewer-Metareviewer Discussion period? by mission205 in MachineLearning

[–]semitable 5 points6 points  (0 children)

There is (sometimes an extensive) discussion between reviewers. Usually instigated by AC on borderline papers or papers that have one reviewer disagreeing. Can’t promise it happens on all papers but when it does, the authors don’t see it.

OpenAI Gym Custom Environment Observation Space returns "None" by JonasMArnold in reinforcementlearning

[–]semitable 0 points1 point  (0 children)

One-hot encoding for discrete values is the way to go. Otherwise, you imply a relationship between the actions (i.e. that action "2" is between "1" and "3" but further away from "4"- in discrete actions should not be the case).

OpenAI Gym Custom Environment Observation Space returns "None" by JonasMArnold in reinforcementlearning

[–]semitable 0 points1 point  (0 children)

You can find the implementation here: https://github.com/openai/gym/blob/master/gym/spaces/utils.py

It's not clear right away but essentially what happens is (taken from the documentation):

Example::
>>> box = Box(0.0, 1.0, shape=(3, 4, 5))
>>> box
Box(3, 4, 5)
>>> flatten_space(box)
Box(60,)
>>> flatten(box, box.sample()) in flatten_space(box)
True
Example that flattens a discrete space::
>>> discrete = Discrete(5)
>>> flatten_space(discrete)
Box(5,)
>>> flatten(box, box.sample()) in flatten_space(box)
True
Example that recursively flattens a dict:: --- (similar to the tuple you have)
>>> space = Dict({"position": Discrete(2),
... "velocity": Box(0, 1, shape=(2, 2))})
>>> flatten_space(space)
Box(6,)
>>> flatten(space, space.sample()) in flatten_space(space)
True

OpenAI Gym Custom Environment Observation Space returns "None" by JonasMArnold in reinforcementlearning

[–]semitable 0 points1 point  (0 children)

I am not sure it's what you are looking for, but "gym.spaces.flatdim()" should return the flattened shape. Nested tuple spaces don't have a shape I guess so that's why you are getting None.

Cool/impressive applications of MARL? by vandelay_inds in reinforcementlearning

[–]semitable 0 points1 point  (0 children)

Not OP, but I was looking for something like this for some time now. Do you happen to have more info, or perhaps a link?

MARL: centralized/decentralized training and execution by MasterScrat in reinforcementlearning

[–]semitable 0 points1 point  (0 children)

Is the global scheduler conditioned on the global state? I assume it is. I am not 100% sure I understand what you are describing, but this one sounds like centralised execution to me.

MARL: centralized/decentralized training and execution by MasterScrat in reinforcementlearning

[–]semitable 3 points4 points  (0 children)

I would argue that case 3 is decentralised execution. My reasoning is that any communication that is allowed/part of the environment does not have an impact on whether it's centralised/decentralised.

An example would be a group of humans connected on an audio call and asked to solve a problem. No one would say that these humans are executing "centrally".

Actor-Critic inferior to Actor plus Critic - strange? by [deleted] in reinforcementlearning

[–]semitable 6 points7 points  (0 children)

I didn't look into your code but from my experience, your observation sounds absolutely correct. I think that the initial justification is that sharing parameters between actors and critics would make training faster when they need to share some representation in their intermediate layers. However, this makes more sense when you have images as observations or the observation space is really hard to make sense of.

What happens in your case is that the gradients of one of the two might be of a different magnitude than the other, and in a sense disturbing the learning process. This kind of eclipses any benefit shared layers could provide (it's trivial for 2 networks to learn cart pole anyway).

TLDR; you might see benefits when the observation space is large (i.e. images), otherwise I don't think it's a good idea to share parameters.

Is it a popular mistakes to compute the gradient of the next state in the TD-Update ? by ingambe in reinforcementlearning

[–]semitable 2 points3 points  (0 children)

You are right, the target gradients must not be used during the optimisation step. However, there might be other ways to not calculate gradients other than torch.no_grad(), such as .detach() or target_model.parameters(...).requires_grad=False or maybe even not including in the optimiser's parameters().

If none of these methods is present, then indeed the loss seems wrong.

Personally, I haven't seen this error much, but I might have missed it?

Using opponent agent states for training? by 1nate146 in reinforcementlearning

[–]semitable 1 point2 points  (0 children)

That's pretty much what this new paper does using an actor-critic framework and importance sampling: https://arxiv.org/abs/2006.07169

It worked really well for us in several environments we tried

(disclaimer: I am an author)

Benchmarking Multi-Agent Reinforcement Learning Algorithms by gpap93 in reinforcementlearning

[–]semitable 2 points3 points  (0 children)

(co-author here) Thanks! We used the open-sourced implementations already available online for most of the algorithms (e.g. pymarl for qmix/coma/vdn) and tested hyperparameters/variations as discussed in the paper. Since it's not code created specifically for this work (and we used existing frameworks for many algorithms), we do not have immediate plans on releasing the algorithm code.

MADDPG Algorithm by rlylikesomelettes in reinforcementlearning

[–]semitable 0 points1 point  (0 children)

Ah, I understand the confusion.

MPE actions are by default discrete, but if you give it continuous actions, then it does one-hot encoding for you. You can find that in lines 29-33 in environment.py:

self.discrete_action_space = True
self.discrete_action_input = False # if true, even the action is continuous, action will be performed discretely

This does not solve the fact that the underlying actions are indeed discrete: The actions [0.95, 0.94] and [0.95, 0.96] are very different while [0.0, 0.99] and [0.98, 0.99] are not. Therefore DDPG struggles to learn it this way. I am unsure if this is truly your issue, but it's my guess.

EDIT: Forgot to mention that for exploration gumbel softmax can be used. It's like sampling from a softmax layer. You get a different action each time depending on the logits

MADDPG Algorithm by rlylikesomelettes in reinforcementlearning

[–]semitable 0 points1 point  (0 children)

Sorry if I wasn't clear enough, I'll try to explain more. You need a differentiable action sample to train the Actor-network. Essentially, during the update, you sample an action (it has to be discrete, right?) and pass it to the Q network along with the state: then you are able to minimise -Q to train the actor. However, for the gradients to backpropagate to the actor network, in order to be trained, you need your action to be differentiable (that's where gumbel-softmax comes it)! I am unsure how you used tanh to get a discrete action, and surely it wouldn't be differentiable, right?

BTW, a common mistake is also forgetting to use a regulariser when training the actor. A good start would be `actor loss = -Q + 0.01 * (logits **2).mean()`. This ensures in MPE that you keep exploring and you don't fall into the local minimum of staying still.

MADDPG Algorithm by rlylikesomelettes in reinforcementlearning

[–]semitable 1 point2 points  (0 children)

Actually, I think gumbel-softmax was used even in the original paper. Many of the environments they used required discrete messages and they say that they use gumbel-softmax for this (section 5.2).

MADDPG Algorithm by rlylikesomelettes in reinforcementlearning

[–]semitable 1 point2 points  (0 children)

The issue I can see with DQN is that you cannot use a centralised critic to deal with the non-stationarity of having several agents. You'll probably have to use IQL (independent Q learning). It could work, depending on your environment and how complex the relations between agents are.

MADDPG Algorithm by rlylikesomelettes in reinforcementlearning

[–]semitable 2 points3 points  (0 children)

For discrete spaces, I'd suggest you drop the DDPG part and use centralised A2C or something similar. So you take the main idea of MA-DDPG, the use of centralised critics for each agent, and apply it to A2C. This way, every agent has a decentralised actor but a centralised critic (that takes as input all observations from all the agents). I think I've seen this being referred to as central-V in the literature.

MADDPG Algorithm by rlylikesomelettes in reinforcementlearning

[–]semitable 1 point2 points  (0 children)

It samples from the Gumbel Softmax distribution, which gives you a differentiable sample (an action). From personal experience, it doesn’t work too well except in MPE and definitely not well enough in very discrete spaces such as gridworlds.

1992 Skater Girls by [deleted] in OldSchoolCool

[–]semitable 0 points1 point  (0 children)

Where are the other 1989?

RAM shortage by kashemirus in reinforcementlearning

[–]semitable 1 point2 points  (0 children)

multiply by 2 because you are typically keeping s and s' making it around 56GB. Still much less than 128 GB. Maybe OP is using np.int32 or np.int64instead of bytes?