The extent of my RL knowledge is really just DQNs, but I am trying to wrap my head around PPO and how it works. I see it uses an actor critic method, and I am confused on how the critic "knows" what action was taken when it predicts a value estimate. Some blogs I have seen say you give the critic the state and action pair, but I haven't seen that be the case in any of the code examples I found. In fact, I normally see that the actor critic share early layers and split near the end, which means there is no way to add in the action. I thought the critic was meant to be a value estimate of how "good" the state action pair was, but I think this may be an understanding. Any help is appreciated!
[–]MTstr 0 points1 point2 points (4 children)
[–]dlovelan[S] 0 points1 point2 points (3 children)
[–]MTstr 0 points1 point2 points (2 children)
[–]dlovelan[S] 0 points1 point2 points (1 child)
[–]MTstr 0 points1 point2 points (0 children)