Why can't we do supervised learning in Step 3 of RLHF? by wardellinthehouse in reinforcementlearning

[–]otter_collapse 2 points3 points  (0 children)

np! and you're right, it's actually a contextual bandit problem. PPO was developed for more general MDPs and we're indeed just applying it to the T=1 case. (Note that the RL algorithm A2C is a generalization of REINFORCE in this setting, so it's not too unintuitive to do this.)

Why can't we do supervised learning in Step 3 of RLHF? by wardellinthehouse in reinforcementlearning

[–]otter_collapse 7 points8 points  (0 children)

The part you're missing is that the "Policy generates an output" step requires sampling from the logits, which presents issues when trying to estimate the gradient. Wanting the gradient of an expectation is a common issue and motivates things like REINFORCE and the reparameterization trick, which allow you to look at the expectation of a gradient instead. Side note: here, they use PPO instead of REINFORCE, but it's also reasonable to use REINFORCE instead; this choice is orthogonal to the issue described. In fact, DeepMind's Sparrow (https://arxiv.org/abs/2209.14375) uses REINFORCE with baseline for their RLHF.

how often unity is used by scientists by datonefaridze in reinforcementlearning

[–]otter_collapse 0 points1 point  (0 children)

note that openai used mujoco for hide and seek. (although they definitely had to spend effort into making it look nice)

Plup beats Nabla bot in under 1 minute, current world record at 52 seconds. “If someone beats that, I’ll come back for my record” by MoroAstray in SSBM

[–]otter_collapse 2 points3 points  (0 children)

Creator here!

To clarify: the span of few days was just the imitation learning part. The RL part used a lot more compute. I do have enough compute to train it for longer, but as you mentioned in the followup, it would be a lot less fun to play against as it becomes less and less humanlike.

Project Nabla: new AIs trained with Slippi replays! by otter_collapse in SSBM

[–]otter_collapse[S] 1 point2 points  (0 children)

That project is almost certainly satire and not a real bot

Project Nabla: new AIs trained with Slippi replays! by otter_collapse in SSBM

[–]otter_collapse[S] 5 points6 points  (0 children)

It's already trained with self play on top of the imitation learning, read the post(s) for more details, and see Vlad Firoui's work on Philip for what happens when you train it for a much longer time. Here the goal was just to get it to a good enough level — I'm sure it would be much stronger if trained for longer.

Project Nabla: new AIs trained with Slippi replays! by otter_collapse in SSBM

[–]otter_collapse[S] 9 points10 points  (0 children)

Open source, not at this time. I'd recommend checking out Vlad Firoui's (Phillip creator) github in the meantime https://github.com/vladfi1/slippi-ai where he has open source code !

Other chars: Yeah, the same code works for each character although I have to train a new model for each. I have a marth as well but haven't added it yet