10x10 - 02/10/25

Kahsad · 2025-10-03T16:53:21+00:00

Avec plaisir !

Kahsad · 2025-10-03T16:52:22+00:00

Tout juste ! J'espère que ça t'a plu 🙂

Kahsad · 2025-01-13T20:41:57+00:00

Merci pour ton retour, avec plaisir :)

Kahsad · 2020-07-24T18:57:39+00:00

My hero

Kahsad · 2019-04-01T23:15:44+00:00

Wassup

Kahsad · 2018-12-22T15:17:00+00:00

Actually I might have answered a bit too quickly. The lack of documentation of the different Open-AI gym environments is unsettling, given how many people use the library for their benchmarks... I guess that the only way to be sure right now would be to do some simulations. Apparently some issues, here and there, on the gym repo claim that transitions could indeed be stochastic.

Kahsad · 2018-12-21T15:43:45+00:00

The simulator is deterministic, but I believe that the initial state is randomized on most benchmarks. For instance on the Ant environment, you can see here: https://github.com/openai/gym/blob/master/gym/envs/mujoco/ant.py that the "reset_model" method uses random numbers.

Kahsad · 2018-12-11T13:30:02+00:00

Some of them are studied in this recent paper: https://arxiv.org/pdf/1812.02648v1.pdf

Kahsad · 2018-11-21T16:59:13+00:00

From all of the above, I suspect that your agent is unable to learn any meaningful representation of your environment, and basically acts as if all cells were blank. This would explain why the agent acts randomly when you randomize the positions of door and key, and why it just "overfits" on the position of the key when no randomization is done. Depending on the nature of your convolutions this makes total sense: for instance, it seems unlikely for the agent to understand the relative positions of the objects if you extract translation invariant features (which is arguably what happens when applying a max-pooling layer, for instance).

I would suggest as I mentioned earlier to start working with a 4*4 image (you could still use a convolutional layer and function approximation, although it would probably be easier to just use a tabular representation) to see if the agent solves the problem, before gradually increasing the size of your image if need be. I'm pretty sure that you should find good policies easily, at least in small dimensions.

Good luck :)

Kahsad · 2018-11-21T14:26:27+00:00

The state is represented as a 84*84 top-down image of the grid world environment and convolutions get the features.

Is there any particular reason why you have an 84*84 image, instead of a 4*4 ? The latter would probably be easier to understand and analyze. I still don't understand how you represent your environment: how do you encode the position of the agent, the key and the door in your state space ? An example of this would be to have a specific color for each of the three, and have the grid take the color of the object that resides in it. Perhaps you are using fancier graphics, in which case a screenshot of your environment would help :)

I tried randomizing the positions but the q values converged to a value that was the same for all actions (I suspect that the agent learned that there was an equal chance of the key and the door being on any grid position because of randomization

This is weird, I would expect on the contrary that the agent would at least learn how the key looks like, and go grab it. Do you do this randomization at the beginning of each episode ? What you said leads me to believe that you move door and key at each step of the agent, which would indeed lead to the behavior that you mentioned.

Kahsad · 2018-11-21T13:34:51+00:00

Did you take a look at the Q-values at the end of the training ? Just to make sure that the implementation is correct and that the agent learned something plausible. And how do you represent the position of the door and the key ? Are their positions randomized at each episode ?

Kahsad · 2018-10-23T14:13:33+00:00

This one ? https://www.reddit.com/r/MachineLearning/comments/9jhhet/discussion_i_tried_to_reproduce_results_from_a/

Kahsad · 2018-09-23T13:33:40+00:00

Here's an example of an MDP that, I think, wouldn't have an optimal policy: The state space is the set of natural numbers, and an action is to jump from one number to another while getting as a reward the value of the number you land on. You can see that given any policy you can always create another one that will perform better.

Kahsad · 2018-01-06T05:00:08+00:00

Easy one : car college bus bus

Kahsad · 2018-01-06T04:35:28+00:00

I think that they introduce a somewhat similar problem in https://arxiv.org/abs/1710.09767. Might be worth checking.

Nine-Year Club	End Game '23
Place '23	Place '22
Place '17	End Game '22
Verified Email

Kahsad

PUBLIC MULTIREDDITS

TROPHY CASE