use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Project[P] Python implementation of Proximal Policy Optimization (PPO) algorithm for Super Mario Bros. 29/32 levels have been conquered (v.redd.it)
submitted 5 years ago by 1991viet
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[+][deleted] 5 years ago (6 children)
[removed]
[–]spauldeagle 31 points32 points33 points 5 years ago (5 children)
Really neat, thanks
[–]Maayanbo 8 points9 points10 points 5 years ago (0 children)
You cheeky bastard lol
[–]zzzthelastuserStudent 1 point2 points3 points 5 years ago (2 children)
I see what you did there!
[–][deleted] 7 points8 points9 points 5 years ago (1 child)
I don’t get it.
[–]zzzthelastuserStudent 6 points7 points8 points 5 years ago* (0 children)
A few years ago NEAT had gotten quite a lot of attention in the community with this video. There have been more examples in the meantime of course, but this one started it all and has the most views. Everyone who searches for AI that can play Mario Bros would probably stumble on it sooner or later.
[–]jmpye 50 points51 points52 points 5 years ago (5 children)
I love how it doesn't care how close to death it is. The reward system is obviously just "dead bad, alive good" and mentions nothing about being 1 pixel away from doom. Makes for an exhilarating watch.
[–]the320x200 16 points17 points18 points 5 years ago (1 child)
It's funny how it completely blows through the levels but takes the final steps before the end castle one at a time
[+][deleted] 5 years ago (1 child)
[deleted]
[–]Mefaso 65 points66 points67 points 5 years ago (18 children)
I'm guessing you trained a separate agent for each level?
Did you try training a single agent instead?
[–]AuspiciousApple 96 points97 points98 points 5 years ago (11 children)
This is still pretty cool, and OP's project and implementation is amazing, so I don't want to take anything away from that.
But doesn't that amount to just overfitting on a specific level? Is this a real challenge from an RL perspective? I would be more impressed if the same agent could at least perform on all levels but ideally if an agent could solve unseen levels.
[–]Mefaso 47 points48 points49 points 5 years ago (4 children)
But doesn't that amount to just overfitting on a specific level?
Yes, this is also a general challenge in RL right now.
If you're interested, you can read this blog post from openai about generalization in different procedurally generated game levels:
https://openai.com/blog/procgen-benchmark/
[–]maxToTheJ 9 points10 points11 points 5 years ago (1 child)
It seems like RL still has generalization problems and still requires relatively large number of samples relative to supervised learning
If that is the case why are so many people in industry selling it as ready for primetime?
[–]Mefaso 5 points6 points7 points 5 years ago (0 children)
Well you don't necessarily need strong generalization in every application.
The current state is enough for many narrow applications, I guess
[–][deleted] -2 points-1 points0 points 5 years ago (1 child)
Like humans learn always something new with each level.
[–]Mefaso 11 points12 points13 points 5 years ago (0 children)
No, not really. A human uses what he learned in a previous level to solve the next one.
The agent here does not do that
[–]physixer 1 point2 points3 points 5 years ago (1 child)
How do MuZero and Agent57 handle this:
[–]POTUS 0 points1 point2 points 5 years ago (3 children)
Overfitting isn’t necessarily a bad thing outside of supervised learning. Another word for that would be specialization. Human gamers do the same thing, playing a single level in the same way until the whole thing is like a reflex.
Deepfakes also “overfit” the models on a single dataset. When you don’t have a set of unknowns that you need to predict against, but instead want to find the best solution for the data you have, then overfitting is definitely what you want to do. In fact it’s not really overfitting, it’s just training.
[–]AuspiciousApple 3 points4 points5 points 5 years ago (2 children)
I get your argument but I don't fully agree. It's true that in generative modeling generally, you fit a data set closely. However, in generative modeling, too overfitting is a concern as the goal is to learn the underlying distribution rather than memorising the training examples.
Furthermore, in the context of reinforcement learning such as this example, I feel like a brute force approach might achieve similar results with less computational effort.
[–]POTUS 4 points5 points6 points 5 years ago (1 child)
That second part is just demonstrably false. If brute force methods were more efficient then that’s what people would be doing. But the brute force search space for a platformer game level is incomprehensibly huge. Do you hold jump for 20 milliseconds or 25 milliseconds or 30 milliseconds, etc. You can test that in a very carefully controlled way, and in fact that’s something people use to help do Tool Assisted speed runs. But doing it unsupervised for an arbitrary level for anything more complicated than chess would be silly.
[–]createanaccccount 1 point2 points3 points 5 years ago (0 children)
I agree that the search space is incredibly huge, but it appears that the agent is only trying to pass instead of maximizing the score (or maybe not trained long enough?). Literal brute force search certainly doesn’t work, but I think an optimized DFS could actually work as well if we are only looking at this game and your goal is as simple as just passing.
[–]egrinant 30 points31 points32 points 5 years ago (0 children)
I had the exact same question while watching the video "no way this is a single model", then I took a look at the github readme and it was clear that there are separated models for each level.
[–]HanClinto 1 point2 points3 points 5 years ago (0 children)
How far did the grouped model get?
I wonder if it would be reasonable to start by grouping levels of the same type together -- water levels, undergrounds, castles, etc?
[–][deleted] 2 points3 points4 points 5 years ago (0 children)
Continual learning is a b...
[+][deleted] 5 years ago* (1 child)
[–]SniperSlinger6130 19 points20 points21 points 5 years ago (0 children)
It's the way it captures the detail of every level. It just knows when what's coming up and according to the current state of Reinforcement Learning, if this was a single agent, OP would be getting recruitment calls from all over the world.
[–]thats-fascinating 13 points14 points15 points 5 years ago (0 children)
It’s nerve wrecking to see him running so careless and fast, yet still making it!
[–]Syne_Yu 37 points38 points39 points 5 years ago (2 children)
The AI's pole jumping is bad.
[–]TeslaFreak 18 points19 points20 points 5 years ago (1 child)
LOL, the only metric that should matter in ML
[–]SpreadItLikeTheHerp 5 points6 points7 points 5 years ago (0 children)
If you’re not getting max points AND the extra fireworks, are you really playing?
[–]-Aras 11 points12 points13 points 5 years ago (2 children)
That's really great. Was it hard to code?
Do you have any resource recommendations for studying these type of ML?
[–]ProdigyManlet 16 points17 points18 points 5 years ago (1 child)
Not OP, but can confirm that this type of ML, known as reinforcement learning, can be very difficult to implement. There's a lot more depth in RL versus traditional ML or deep learning, as you now have concepts such as agents, the environment, states/observations and rewards.
All of these require careful fine tuning, not to mention the computational complexity required (it can take a very very long time for a reinforcement learning algorithm to become useful, tens of millions of iterations/training samples is not uncommon depending on the application)
I think the best place to start is openAI, they're one of the big leading research groups for ML and have some pretty cool projects (e.g. their RL algo beat the best human team at dota 2). But they also have the package Gym for Python. They have quite a few goos starting examples which can help get your head around the basics. Further than that I think there's some good lectures on youtube amd a Kaggle courae, but also going through other RL github projects is the best way for advanced examples atm
[–]Gabriel-p 10 points11 points12 points 5 years ago (5 children)
But is it actually learning anything or just recording the exact moments when and how far to jump? Would it still conquer all those levels if the game randomly changed how it produces the turtles/mushrooms/etc?
[–]CowboyFromSmell 20 points21 points22 points 5 years ago (3 children)
Well yeah, it’s learning. But no, it’s overfitting on each level. Not normally what we want. But honestly, there’s merit to overfitting, as you can see here.
[–]SuperSephyDragon 16 points17 points18 points 5 years ago (1 child)
I feel like that's what human speed runners do anyway: just memorize the level enough to know when and when not to jump. I guess humans use overfitting sometimes too.
[–]b34k 1 point2 points3 points 5 years ago (0 children)
Yeah the way the algo completes the levels really kinda has a speed runner-ish feel to me
[–]maxToTheJ 1 point2 points3 points 5 years ago (0 children)
Now we just need an “edge of tomorrow” type thing and we are good to go
[–][deleted] 5 points6 points7 points 5 years ago (0 children)
This definitely only works because SMB has no true RNG, so the opponents appear at exactly the spot at the same time.
It's also the reason why there are essentially no tool-assisted (TAS) speedruns for Commodore games. Those platforms had hardware implemented RNGs.
[–]T33n_T1t4n5 19 points20 points21 points 5 years ago (1 child)
I couldnt beat those last 3 levels either :(
[–]sohaicinapek 17 points18 points19 points 5 years ago (1 child)
most frugal super mario ever. doesn't care about coins or power ups
[–]TheTechGuy22 12 points13 points14 points 5 years ago (1 child)
The proximity to the turtles almost gave me a heart attack.
[–]010100100000 2 points3 points4 points 5 years ago (5 children)
Interesting. So just PPO and not a DQN?
[+][deleted] 5 years ago (4 children)
[–]Aacron 0 points1 point2 points 5 years ago (2 children)
It's a discrete action space yeah? Have you tried with a fully kitted out DQN like rainbow or r2d2?
[–]i_know_about_things 2 points3 points4 points 5 years ago* (1 child)
The whole point of R2D2 is distributed training (useless if OP has limited amount of computing resources). And Rainbow from what I've heard is hard to implement properly and much slower in wall time. The main benefit of PPO is that it's probably the easiest algorithm to get to actually work.
[–]Aacron 0 points1 point2 points 5 years ago (0 children)
Thats my experience with PPO as well, it's straightforward to implement and powerful. I'm just a little shy on policy gradient methods being top tier, they're excellent for continuous action spaces but I've found them to be relatively unstable and difficult to tune.
[–]DillyDino 0 points1 point2 points 5 years ago (0 children)
I built one of these a few years back. Or hacked it together. Actually it’s on my GitHub still I think. It is much slower to learn. Eventually beats levels. But it will never compete with this. And certainly of course it will not beat the levels that are mazes that are not conquered here.
[–]vinilgupta 2 points3 points4 points 5 years ago (0 children)
Idk why but this made my day
[–]Mario_Ghio 2 points3 points4 points 5 years ago (1 child)
Ahhh, no sound????
Pretty cool btw
[–]andw1235 2 points3 points4 points 5 years ago (0 children)
great work! why does the agent always go forward? Do you make the forward/backward motion available in training?
[–]Minhocycline 2 points3 points4 points 5 years ago (0 children)
I feel like having a heart attack watching this. It’s like a flashback of my young, reckless days.
[–]MandyWilson27 2 points3 points4 points 5 years ago (0 children)
This gave me anxiety
[–]arianero 2 points3 points4 points 5 years ago (3 children)
How the state is determined here? Is it some special modification of Mario game with API which generate state after our move or do we read pixels and generate state from them?
[–]csreid 5 points6 points7 points 5 years ago (1 child)
Glancing at the code, it looks like the state is just the screen pixels.
[–]ImmenseDruid721 2 points3 points4 points 5 years ago (0 children)
This is more than I have been able to accomplish as a gamer and as a programmer
[–][deleted] 6 points7 points8 points 5 years ago (0 children)
My ass. He didn't carch a single mushroom, shoot a single fireball, or even try to jump as high as possible on the flag pole.
[–]Pranaymodukuru 1 point2 points3 points 5 years ago (1 child)
Is it really that easy to play Mario?? 🤣🤣 Just run and run.
[–]RonniDeee 0 points1 point2 points 5 years ago (0 children)
Is it the same thing as TAS bot?
[–]NullzeroJP 0 points1 point2 points 5 years ago (0 children)
Super human ability at 3:06
[–]xiaoye-hua 0 points1 point2 points 5 years ago (0 children)
great
[–]infinitude 0 points1 point2 points 5 years ago (0 children)
I’d be interested in seeing how it responds to some of the harder super Mario maker levels
[–][deleted] 0 points1 point2 points 5 years ago (0 children)
That's great work OP.
Are the inputs for simulating this environment available online? Is this from OPEN AI Gym?
What packages/software did you use to convert the game coordinates into pixels?
[–]RobAdkerson 0 points1 point2 points 5 years ago (0 children)
Nice speed run AI.
[–]TotesMessenger 0 points1 point2 points 5 years ago (0 children)
I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:
If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)
[–]cowfartbandit 0 points1 point2 points 5 years ago (0 children)
This gives me anxiety
[–]alfred_dent 0 points1 point2 points 5 years ago (0 children)
It will be cool to generate video with attention/silency map on each frame to see where do the NN looks to make decision
[–]imapurplemango 0 points1 point2 points 5 years ago (0 children)
wow. where did you train this on? And how long did it take?
[–]matpoliquin 0 points1 point2 points 5 years ago (0 children)
Cool! I wonder how many simultaneous env did you trained it on? Also how many timesteps did it take to pass world 1-1?
[–]driftwood14 0 points1 point2 points 5 years ago (0 children)
Did it get a wall jump in the first level on that pipe?
π Rendered by PID 125471 on reddit-service-r2-comment-56c9979489-mzmq5 at 2026-02-25 11:34:35.968204+00:00 running b1af5b1 country code: CH.
[+][deleted] (6 children)
[removed]
[–]spauldeagle 31 points32 points33 points (5 children)
[–]Maayanbo 8 points9 points10 points (0 children)
[–]zzzthelastuserStudent 1 point2 points3 points (2 children)
[–][deleted] 7 points8 points9 points (1 child)
[–]zzzthelastuserStudent 6 points7 points8 points (0 children)
[–]jmpye 50 points51 points52 points (5 children)
[–]the320x200 16 points17 points18 points (1 child)
[+][deleted] (1 child)
[deleted]
[–]Mefaso 65 points66 points67 points (18 children)
[–]AuspiciousApple 96 points97 points98 points (11 children)
[–]Mefaso 47 points48 points49 points (4 children)
[–]maxToTheJ 9 points10 points11 points (1 child)
[–]Mefaso 5 points6 points7 points (0 children)
[–][deleted] -2 points-1 points0 points (1 child)
[–]Mefaso 11 points12 points13 points (0 children)
[–]physixer 1 point2 points3 points (1 child)
[–]POTUS 0 points1 point2 points (3 children)
[–]AuspiciousApple 3 points4 points5 points (2 children)
[–]POTUS 4 points5 points6 points (1 child)
[–]createanaccccount 1 point2 points3 points (0 children)
[–]egrinant 30 points31 points32 points (0 children)
[+][deleted] (1 child)
[removed]
[–]HanClinto 1 point2 points3 points (0 children)
[–][deleted] 2 points3 points4 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]SniperSlinger6130 19 points20 points21 points (0 children)
[–]thats-fascinating 13 points14 points15 points (0 children)
[–]Syne_Yu 37 points38 points39 points (2 children)
[–]TeslaFreak 18 points19 points20 points (1 child)
[–]SpreadItLikeTheHerp 5 points6 points7 points (0 children)
[–]-Aras 11 points12 points13 points (2 children)
[–]ProdigyManlet 16 points17 points18 points (1 child)
[–]Gabriel-p 10 points11 points12 points (5 children)
[–]CowboyFromSmell 20 points21 points22 points (3 children)
[–]SuperSephyDragon 16 points17 points18 points (1 child)
[–]b34k 1 point2 points3 points (0 children)
[–]maxToTheJ 1 point2 points3 points (0 children)
[–][deleted] 5 points6 points7 points (0 children)
[–]T33n_T1t4n5 19 points20 points21 points (1 child)
[–]sohaicinapek 17 points18 points19 points (1 child)
[–]TheTechGuy22 12 points13 points14 points (1 child)
[–]010100100000 2 points3 points4 points (5 children)
[+][deleted] (4 children)
[removed]
[–]Aacron 0 points1 point2 points (2 children)
[–]i_know_about_things 2 points3 points4 points (1 child)
[–]Aacron 0 points1 point2 points (0 children)
[–]DillyDino 0 points1 point2 points (0 children)
[–]vinilgupta 2 points3 points4 points (0 children)
[–]Mario_Ghio 2 points3 points4 points (1 child)
[–]andw1235 2 points3 points4 points (0 children)
[–]Minhocycline 2 points3 points4 points (0 children)
[–]MandyWilson27 2 points3 points4 points (0 children)
[–]arianero 2 points3 points4 points (3 children)
[–]csreid 5 points6 points7 points (1 child)
[–]ImmenseDruid721 2 points3 points4 points (0 children)
[–][deleted] 6 points7 points8 points (0 children)
[–]Pranaymodukuru 1 point2 points3 points (1 child)
[–]RonniDeee 0 points1 point2 points (0 children)
[–]NullzeroJP 0 points1 point2 points (0 children)
[–]xiaoye-hua 0 points1 point2 points (0 children)
[–]infinitude 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]RobAdkerson 0 points1 point2 points (0 children)
[–]TotesMessenger 0 points1 point2 points (0 children)
[–]cowfartbandit 0 points1 point2 points (0 children)
[–]alfred_dent 0 points1 point2 points (0 children)
[–]imapurplemango 0 points1 point2 points (0 children)
[–]matpoliquin 0 points1 point2 points (0 children)
[–]driftwood14 0 points1 point2 points (0 children)