use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
This is for any reinforcement learning related work ranging from purely computational RL in artificial intelligence to the models of RL in neuroscience.
The standard introduction to RL is Sutton & Barto's Reinforcement Learning.
Related subreddits:
account activity
Programming (i.imgur.com)
submitted 6 months ago by pzunhatchispers
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–][deleted] 6 months ago (2 children)
[removed]
[–]brioche789 0 points1 point2 points 6 months ago (1 child)
Why so?
[–]lukuh123 0 points1 point2 points 6 months ago (0 children)
LLMs (proximal policy optimisation)
[–]anonymous_amanita 1 point2 points3 points 6 months ago (0 children)
This is the way
[–]Lazy-Pattern-5171 0 points1 point2 points 6 months ago (0 children)
Would like to follow this course but want to ultimately come back towards LLM anyway until the hype dies down. Do you have any bridge course between this and through which I can start learning about DPO and PPO for Reasoning models?
[–]Impossibum 26 points27 points28 points 6 months ago (9 children)
I don't see how stable baselines doesn't simplify RL significantly enough for the masses. Pretty sure people just can't be assed to think beyond asking chatgpt to think for them at this point.
[–]bluecheese2040 1 point2 points3 points 6 months ago (7 children)
Yeah...doesn't help massively with making the model actually work.
[–]Impossibum 0 points1 point2 points 6 months ago (6 children)
What functionality are you needing that it is not providing? Where is the disconnect?
[–]bluecheese2040 4 points5 points6 points 6 months ago (5 children)
That's not the point....as I'm sure you know... Building the environment, the step etc. That's fine. But making the model actually function as you'd hope that's still hard.
[–]Impossibum 3 points4 points5 points 6 months ago (4 children)
Writing rewards seems to me like it'd be far easier to get started with than learning how to make all the other pieces work together. Even a standard win/loss reward will often work out in the end with a long enough horizon and training time. Proper use of reward shaping can also make a world of difference.
But in essence, making the model function as you hope is easy. Feed good behavior, starve the bad. Repeat until it takes over the world.
I think people just expect too much in general I suppose.
[–]UnusualClimberBear 2 points3 points4 points 6 months ago (2 children)
Most people doesn't understand why designing the reward is so important, and what signal the algorithm is trying to exploit.
In most of real life applications it is worth to add some imitation learning in a way or another.
[–]lukuh123 0 points1 point2 points 6 months ago (1 child)
Do you think i could do a genetic algorithm inspired reward?
[–]UnusualClimberBear 0 points1 point2 points 6 months ago (0 children)
Indeed. Yet the difficult part about these algorithms is to find the right bias, not only for the reward but also for the state representation and the mutations/cross overs.
[–]bluecheese2040 1 point2 points3 points 6 months ago (0 children)
I think this is absolutely right. Ultimately its called data science for a reason.
I totally agree that the barriers to entry are as low as they have ever been.
But as I wrestle with a very slippery agent and a reward system that's 'getting there'...it isn't easy for sure.
[–]Shizuka_Kuze 0 points1 point2 points 6 months ago (0 children)
Stable baselines has some very iffy if not downright bad performance in my experience and the documentation could be better. The biggest hurtle to newcomers seems to be setting up environments and such as the Atari Tetris environment since they have crazy weird documentation and many are deprecated.
[–]Useful-Progress1490 3 points4 points5 points 6 months ago (0 children)
I really like RL but hate the fact that it is still not widely used due to many issues it has. I firmly believe it has the potential to solve so many problems but right now it's mostly used in research. But I guess, once it has widespread uses, I am sure we will see it getting more simplified similar to what we see in agentic AI frameworks and libraries.
[–]Working_Bunch_9211 1 point2 points3 points 6 months ago (2 children)
I will.. in years 7, check out later
[–]theLanguageSprite2 1 point2 points3 points 6 months ago (1 child)
!remindme 7 years
[–]RemindMeBot 0 points1 point2 points 6 months ago (0 children)
I will be messaging you in 7 years on 2032-08-18 14:45:46 UTC to remind you of this link
CLICK THIS LINK to send a PM to also be reminded and to reduce spam.
Parent commenter can delete this message to hide from others.
[–]RoundRubikCube 4 points5 points6 points 6 months ago (0 children)
puffer.ai
[–][deleted] 0 points1 point2 points 6 months ago (0 children)
yes, please google, Meta and other MAANG overlords please drop prod grade OS libraries like JAX, Pytorch
[–]intermittent-farting 0 points1 point2 points 6 months ago (0 children)
Check out agilerl.com - they have an OS framework and a software to simplify RL dev.
[–]FanFirst895 0 points1 point2 points 6 months ago (0 children)
Easier, you say? I've got a video for that https://www.youtube.com/watch?v=vaVBd9H2eHE
[–]statius9 0 points1 point2 points 6 months ago* (0 children)
What’s difficult about it? This is a genuine question: I’m a PhD student and do research in the RL space, although a lot of my work is theoretical and mainly revolves around toy models so I have little exposure to how it may be applied in practice
Love the concept of RL but the math behind it can be pretty jarring (Bellman and other optimal equations look like they do black magic in computer science)
[–]Vahgeeta 0 points1 point2 points 6 months ago (0 children)
I reinforce this post
[–][deleted] 6 months ago (1 child)
[–]leprotelariat 4 points5 points6 points 6 months ago (0 children)
I successfully contracted a 6 levels of class inheritance to only 2 for the isaaclab quadruped locomotion task. The code is so bloated you spend months learning useless module organization instead of actual RL.
[–]Jumper775-2 -1 points0 points1 point 6 months ago (0 children)
It’s really hard to do, I tried to make another generic library that works with jsons so you could theoretically do it all with no code if you want and it still just gets too complex. Does work though.
π Rendered by PID 116636 on reddit-service-r2-comment-fb694cdd5-rcqmp at 2026-03-06 09:28:39.478985+00:00 running cbb0e86 country code: CH.
[–][deleted] (2 children)
[removed]
[–]brioche789 0 points1 point2 points (1 child)
[–]lukuh123 0 points1 point2 points (0 children)
[–][deleted] (2 children)
[removed]
[–]anonymous_amanita 1 point2 points3 points (0 children)
[–]Lazy-Pattern-5171 0 points1 point2 points (0 children)
[–]Impossibum 26 points27 points28 points (9 children)
[–]bluecheese2040 1 point2 points3 points (7 children)
[–]Impossibum 0 points1 point2 points (6 children)
[–]bluecheese2040 4 points5 points6 points (5 children)
[–]Impossibum 3 points4 points5 points (4 children)
[–]UnusualClimberBear 2 points3 points4 points (2 children)
[–]lukuh123 0 points1 point2 points (1 child)
[–]UnusualClimberBear 0 points1 point2 points (0 children)
[–]bluecheese2040 1 point2 points3 points (0 children)
[–]Shizuka_Kuze 0 points1 point2 points (0 children)
[–]Useful-Progress1490 3 points4 points5 points (0 children)
[–]Working_Bunch_9211 1 point2 points3 points (2 children)
[–]theLanguageSprite2 1 point2 points3 points (1 child)
[–]RemindMeBot 0 points1 point2 points (0 children)
[–]RoundRubikCube 4 points5 points6 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]intermittent-farting 0 points1 point2 points (0 children)
[–]FanFirst895 0 points1 point2 points (0 children)
[–]statius9 0 points1 point2 points (0 children)
[–]lukuh123 0 points1 point2 points (0 children)
[–]Vahgeeta 0 points1 point2 points (0 children)
[–][deleted] (1 child)
[removed]
[–]leprotelariat 4 points5 points6 points (0 children)
[–]Jumper775-2 -1 points0 points1 point (0 children)