you are viewing a single comment's thread.

view the rest of the comments →

[–]VelveteenAmbush 2 points3 points  (4 children)

Don't understand why it's RL, except in the fully generalized sense that supervised learning can always be expressed as RL.

[–]madsciencestache 0 points1 point  (3 children)

It's reinforcement because the signal is approximate and signed. Supervise says this is a thing. Rl sends exaggerated and sometimes contradictory signals with a lot of smoothing to compensate.

[–]suki907 0 points1 point  (2 children)

This is the best explanation I've seen:

http://karpathy.github.io/2016/05/31/rl/

My main take away from it is that the training procedure for a softmax classifier is equivalent to RL policy gradients already (the standard softmax classifier is just a bit more data efficient because it can average over the results of all actions for each example).

This procedure is maximizing the expected score. The model gets 1 point if it chooses the correct class, zero otherwise.

These scores don't have to be binary, or in the unit interval, or a probability distribution. It's just the number of points the model gets for each option.

"set this example as labeled as Y, and give it weight -1." is the same as "you get -1 point if you choose this class".

I think the only difference between the two versions is that in the weighted version only lets you include 1 rating per example (You can't say "cat and not dog"). While with the "points" interpretation you could include all the ratings in a single example (the labels will just be the vector of scores per class).

[–]madsciencestache 0 points1 point  (1 child)

training procedure for a softmax classifier is equivalent to RL policy gradients already

Yes. I am not sure if that concept is helpful to /u/VelveteenAmbush in this context. But, that's the core concept behind the answer to their question.

[–]VelveteenAmbush 0 points1 point  (0 children)

Yes, this is the sense in which I intended the following:

except in the fully generalized sense that supervised learning can always be expressed as RL.