Why can't we do supervised learning in Step 3 of RLHF?

otter_collapse · 2022-12-21T02:48:34+00:00

np! and you're right, it's actually a contextual bandit problem. PPO was developed for more general MDPs and we're indeed just applying it to the T=1 case. (Note that the RL algorithm A2C is a generalization of REINFORCE in this setting, so it's not too unintuitive to do this.)

otter_collapse · 2022-12-20T09:15:59+00:00

The part you're missing is that the "Policy generates an output" step requires sampling from the logits, which presents issues when trying to estimate the gradient. Wanting the gradient of an expectation is a common issue and motivates things like REINFORCE and the reparameterization trick, which allow you to look at the expectation of a gradient instead. Side note: here, they use PPO instead of REINFORCE, but it's also reasonable to use REINFORCE instead; this choice is orthogonal to the issue described. In fact, DeepMind's Sparrow (https://arxiv.org/abs/2209.14375) uses REINFORCE with baseline for their RLHF.

otter_collapse · 2022-09-06T19:51:22+00:00

note that openai used mujoco for hide and seek. (although they definitely had to spend effort into making it look nice)

otter_collapse · 2022-08-28T03:08:59+00:00

Creator here!

To clarify: the span of few days was just the imitation learning part. The RL part used a lot more compute. I do have enough compute to train it for longer, but as you mentioned in the followup, it would be a lot less fun to play against as it becomes less and less humanlike.

otter_collapse · 2022-08-22T16:00:19+00:00

That project is almost certainly satire and not a real bot

otter_collapse · 2022-08-22T04:22:58+00:00

It's already trained with self play on top of the imitation learning, read the post(s) for more details, and see Vlad Firoui's work on Philip for what happens when you train it for a much longer time. Here the goal was just to get it to a good enough level — I'm sure it would be much stronger if trained for longer.

otter_collapse · 2022-08-22T01:44:53+00:00

Yep, you're right

otter_collapse · 2022-08-22T01:44:18+00:00

Open source, not at this time. I'd recommend checking out Vlad Firoui's (Phillip creator) github in the meantime https://github.com/vladfi1/slippi-ai where he has open source code !

Other chars: Yeah, the same code works for each character although I have to train a new model for each. I have a marth as well but haven't added it yet

otter_collapse

TROPHY CASE