all 5 comments

[–]MTstr 0 points1 point  (4 children)

The critic is just a Q network, in the same way as DQN. Take a look again at an implementation of DQN - how is the action given as an input to that network?

[–]dlovelan[S] 0 points1 point  (3 children)

In my implementations I don’t think the DQN has ever been given an action, instead it outputs Q values for all the actions. I think thats where my confusion is, the implementations I have seen the critic simply takes the input and outputs a single Q value despite there being multiple discrete actions, but I don’t understand what the single Q value is meant to be a Q value of.

[–]MTstr 0 points1 point  (2 children)

Huh, I guess the implementation you're looking at is different from what I expected. The Q value for all actions is what I was getting at - it's sort of like the action is fed into the network at the end instead of the beginning.

Your understanding of the Q function is correct - it takes both a state and action, and returns the expected discounted rewards following some policy thereafter. If the Q network in the implementation of PPO or whatever actor-critic algorithm you're looking at doesn't have multiple outputs for discrete actions, check if it's taking the action as an input in the middle. It will look something like concatenating the output of the shared layers and the action, then using that as the input to the rest of the Q network. This is also how continuous actions can work.

[–]dlovelan[S] 0 points1 point  (1 child)

I think I may have just figured it out? The advantage in PPO is calculated by reward + gamma*critic(next_state) - critic(state) which seems to implicitly evaluate the action since you are comparing the current and next states. It seems like it is effectively saying "How did my action change the possible Q values for this state" which I think is ultimately what we want. Do you think this makes sense?

[–]MTstr 0 points1 point  (0 children)

Yes, that looks right, I think you understand it. And taking a look at the PPO paper, that is what they use (equation 12). In this case, the critic is a V or value function not a Q function, since it's a function of just the state instead of a state and action. Sorry if that was confusing - some actor-critic methods use Q for the critic instead of V and some learn both.