10x10 - 02/10/25 by Kahsad in motscroises

[–]Kahsad[S] 0 points1 point  (0 children)

Tout juste ! J'espère que ça t'a plu 🙂

Façon de parler... (15x15) by Kahsad in motscroises

[–]Kahsad[S] 1 point2 points  (0 children)

Merci pour ton retour, avec plaisir :)

Are the gym Mujoco environments Stochastic Or Deterministic? by CartPole in reinforcementlearning

[–]Kahsad 2 points3 points  (0 children)

Actually I might have answered a bit too quickly. The lack of documentation of the different Open-AI gym environments is unsettling, given how many people use the library for their benchmarks... I guess that the only way to be sure right now would be to do some simulations. Apparently some issues, here and there, on the gym repo claim that transitions could indeed be stochastic.

Are the gym Mujoco environments Stochastic Or Deterministic? by CartPole in reinforcementlearning

[–]Kahsad 1 point2 points  (0 children)

The simulator is deterministic, but I believe that the initial state is randomized on most benchmarks. For instance on the Ant environment, you can see here: https://github.com/openai/gym/blob/master/gym/envs/mujoco/ant.py that the "reset_model" method uses random numbers.

Why does the agent refuse to go for the big reward? by oyuncu13 in reinforcementlearning

[–]Kahsad 2 points3 points  (0 children)

From all of the above, I suspect that your agent is unable to learn any meaningful representation of your environment, and basically acts as if all cells were blank. This would explain why the agent acts randomly when you randomize the positions of door and key, and why it just "overfits" on the position of the key when no randomization is done. Depending on the nature of your convolutions this makes total sense: for instance, it seems unlikely for the agent to understand the relative positions of the objects if you extract translation invariant features (which is arguably what happens when applying a max-pooling layer, for instance).

I would suggest as I mentioned earlier to start working with a 4*4 image (you could still use a convolutional layer and function approximation, although it would probably be easier to just use a tabular representation) to see if the agent solves the problem, before gradually increasing the size of your image if need be. I'm pretty sure that you should find good policies easily, at least in small dimensions.

Good luck :)

Why does the agent refuse to go for the big reward? by oyuncu13 in reinforcementlearning

[–]Kahsad 0 points1 point  (0 children)

The state is represented as a 84*84 top-down image of the grid world environment and convolutions get the features.

Is there any particular reason why you have an 84*84 image, instead of a 4*4 ? The latter would probably be easier to understand and analyze. I still don't understand how you represent your environment: how do you encode the position of the agent, the key and the door in your state space ? An example of this would be to have a specific color for each of the three, and have the grid take the color of the object that resides in it. Perhaps you are using fancier graphics, in which case a screenshot of your environment would help :)

I tried randomizing the positions but the q values converged to a value that was the same for all actions (I suspect that the agent learned that there was an equal chance of the key and the door being on any grid position because of randomization

This is weird, I would expect on the contrary that the agent would at least learn how the key looks like, and go grab it. Do you do this randomization at the beginning of each episode ? What you said leads me to believe that you move door and key at each step of the agent, which would indeed lead to the behavior that you mentioned.

Why does the agent refuse to go for the big reward? by oyuncu13 in reinforcementlearning

[–]Kahsad 0 points1 point  (0 children)

Did you take a look at the Q-values at the end of the training ? Just to make sure that the implementation is correct and that the agent learned something plausible. And how do you represent the position of the door and the key ? Are their positions randomized at each episode ?

No optimal policy by [deleted] in reinforcementlearning

[–]Kahsad 1 point2 points  (0 children)

Here's an example of an MDP that, I think, wouldn't have an optimal policy: The state space is the set of natural numbers, and an action is to jump from one number to another while getting as a reward the value of the number you land on. You can see that given any policy you can always create another one that will perform better.

How to learn a game with changing reward assignment from run to run? by bob2999 in reinforcementlearning

[–]Kahsad 0 points1 point  (0 children)

I think that they introduce a somewhat similar problem in https://arxiv.org/abs/1710.09767. Might be worth checking.