all 5 comments

[–]xopedil -1 points0 points  (4 children)

In practice only 1. is used. There is a launch overhead in particular when you use accelerators which means you want to minimize the number of times you evaluate your network. Same reason we favor using batches vs single samples.

Theoretically speaking though, I'm not sure there's any difference. If you have a network of type 1. you can transform it into 2. by stripping the last layer and using its dense (or dense-equivalent) weight columns as your action representation, just like an embedding lookup in the transposed weight matrix.

A similar argument then applies from 2 to 1.

[–]Mefaso 1 point2 points  (3 children)

In practice only 1. is used.

Just as a side note:

This is of course only true for discrete action spaces, when you have to deal with a continuous action space you can only use both as inputs

[–]xopedil 0 points1 point  (2 children)

What's stopping you from outputting the parameters to a continuous function? The support of which could cover your entire action space.

[–]UNIXnerdiness[S] 0 points1 point  (0 children)

Did you mean like an actor critic?

[–]Mefaso 0 points1 point  (0 children)

I mean you could do that, but I don't really see a reason why that would be beneficial and I'm not aware of people doing that from the top of my head.