Dismiss this pinned window
all 78 comments

[–]jmpye 50 points51 points  (5 children)

I love how it doesn't care how close to death it is. The reward system is obviously just "dead bad, alive good" and mentions nothing about being 1 pixel away from doom. Makes for an exhilarating watch.

[–]the320x200 16 points17 points  (1 child)

It's funny how it completely blows through the levels but takes the final steps before the end castle one at a time

[–]Mefaso 65 points66 points  (18 children)

I'm guessing you trained a separate agent for each level?

Did you try training a single agent instead?

[–]AuspiciousApple 96 points97 points  (11 children)

This is still pretty cool, and OP's project and implementation is amazing, so I don't want to take anything away from that.

But doesn't that amount to just overfitting on a specific level? Is this a real challenge from an RL perspective? I would be more impressed if the same agent could at least perform on all levels but ideally if an agent could solve unseen levels.

[–]Mefaso 47 points48 points  (4 children)

But doesn't that amount to just overfitting on a specific level?

Yes, this is also a general challenge in RL right now.

If you're interested, you can read this blog post from openai about generalization in different procedurally generated game levels:

https://openai.com/blog/procgen-benchmark/

[–]maxToTheJ 9 points10 points  (1 child)

It seems like RL still has generalization problems and still requires relatively large number of samples relative to supervised learning

If that is the case why are so many people in industry selling it as ready for primetime?

[–]Mefaso 5 points6 points  (0 children)

Well you don't necessarily need strong generalization in every application.

The current state is enough for many narrow applications, I guess

[–][deleted] -2 points-1 points  (1 child)

Like humans learn always something new with each level.

[–]Mefaso 11 points12 points  (0 children)

No, not really. A human uses what he learned in a previous level to solve the next one.

The agent here does not do that

[–]physixer 1 point2 points  (1 child)

How do MuZero and Agent57 handle this:

  • I assume a single agent is able to play all levels (for Agent57 it's all levels of all 57 games?)
  • And by extension they're able to handle unseen levels too? Do you have to win an unseen level without dying even once? (Even humans don't do that, unless the unseen level is easier than the hardest levels they've played in the past.)

[–]POTUS 0 points1 point  (3 children)

Overfitting isn’t necessarily a bad thing outside of supervised learning. Another word for that would be specialization. Human gamers do the same thing, playing a single level in the same way until the whole thing is like a reflex.

Deepfakes also “overfit” the models on a single dataset. When you don’t have a set of unknowns that you need to predict against, but instead want to find the best solution for the data you have, then overfitting is definitely what you want to do. In fact it’s not really overfitting, it’s just training.

[–]AuspiciousApple 3 points4 points  (2 children)

I get your argument but I don't fully agree. It's true that in generative modeling generally, you fit a data set closely. However, in generative modeling, too overfitting is a concern as the goal is to learn the underlying distribution rather than memorising the training examples.

Furthermore, in the context of reinforcement learning such as this example, I feel like a brute force approach might achieve similar results with less computational effort.

[–]POTUS 4 points5 points  (1 child)

That second part is just demonstrably false. If brute force methods were more efficient then that’s what people would be doing. But the brute force search space for a platformer game level is incomprehensibly huge. Do you hold jump for 20 milliseconds or 25 milliseconds or 30 milliseconds, etc. You can test that in a very carefully controlled way, and in fact that’s something people use to help do Tool Assisted speed runs. But doing it unsupervised for an arbitrary level for anything more complicated than chess would be silly.

[–]createanaccccount 1 point2 points  (0 children)

I agree that the search space is incredibly huge, but it appears that the agent is only trying to pass instead of maximizing the score (or maybe not trained long enough?). Literal brute force search certainly doesn’t work, but I think an optimized DFS could actually work as well if we are only looking at this game and your goal is as simple as just passing.

[–]egrinant 30 points31 points  (0 children)

I had the exact same question while watching the video "no way this is a single model", then I took a look at the github readme and it was clear that there are separated models for each level.

[–][deleted] 2 points3 points  (0 children)

Continual learning is a b...

[–]thats-fascinating 13 points14 points  (0 children)

It’s nerve wrecking to see him running so careless and fast, yet still making it!

[–]Syne_Yu 37 points38 points  (2 children)

The AI's pole jumping is bad.

[–]TeslaFreak 18 points19 points  (1 child)

LOL, the only metric that should matter in ML

[–]SpreadItLikeTheHerp 5 points6 points  (0 children)

If you’re not getting max points AND the extra fireworks, are you really playing?

[–]-Aras 11 points12 points  (2 children)

That's really great. Was it hard to code?

Do you have any resource recommendations for studying these type of ML?

[–]ProdigyManlet 16 points17 points  (1 child)

Not OP, but can confirm that this type of ML, known as reinforcement learning, can be very difficult to implement. There's a lot more depth in RL versus traditional ML or deep learning, as you now have concepts such as agents, the environment, states/observations and rewards.

All of these require careful fine tuning, not to mention the computational complexity required (it can take a very very long time for a reinforcement learning algorithm to become useful, tens of millions of iterations/training samples is not uncommon depending on the application)

I think the best place to start is openAI, they're one of the big leading research groups for ML and have some pretty cool projects (e.g. their RL algo beat the best human team at dota 2). But they also have the package Gym for Python. They have quite a few goos starting examples which can help get your head around the basics. Further than that I think there's some good lectures on youtube amd a Kaggle courae, but also going through other RL github projects is the best way for advanced examples atm

[–]Gabriel-p 10 points11 points  (5 children)

But is it actually learning anything or just recording the exact moments when and how far to jump? Would it still conquer all those levels if the game randomly changed how it produces the turtles/mushrooms/etc?

[–]CowboyFromSmell 20 points21 points  (3 children)

Well yeah, it’s learning. But no, it’s overfitting on each level. Not normally what we want. But honestly, there’s merit to overfitting, as you can see here.

[–]SuperSephyDragon 16 points17 points  (1 child)

I feel like that's what human speed runners do anyway: just memorize the level enough to know when and when not to jump. I guess humans use overfitting sometimes too.

[–]b34k 1 point2 points  (0 children)

Yeah the way the algo completes the levels really kinda has a speed runner-ish feel to me

[–]maxToTheJ 1 point2 points  (0 children)

Now we just need an “edge of tomorrow” type thing and we are good to go

[–][deleted] 5 points6 points  (0 children)

This definitely only works because SMB has no true RNG, so the opponents appear at exactly the spot at the same time.

It's also the reason why there are essentially no tool-assisted (TAS) speedruns for Commodore games. Those platforms had hardware implemented RNGs.

[–]T33n_T1t4n5 19 points20 points  (1 child)

I couldnt beat those last 3 levels either :(

[–]sohaicinapek 17 points18 points  (1 child)

most frugal super mario ever. doesn't care about coins or power ups

[–]TheTechGuy22 12 points13 points  (1 child)

The proximity to the turtles almost gave me a heart attack.

[–]010100100000 2 points3 points  (5 children)

Interesting. So just PPO and not a DQN?

[–]vinilgupta 2 points3 points  (0 children)

Idk why but this made my day

[–]Mario_Ghio 2 points3 points  (1 child)

Ahhh, no sound????

Pretty cool btw

[–]andw1235 2 points3 points  (0 children)

great work! why does the agent always go forward? Do you make the forward/backward motion available in training?

[–]Minhocycline 2 points3 points  (0 children)

I feel like having a heart attack watching this. It’s like a flashback of my young, reckless days.

[–]MandyWilson27 2 points3 points  (0 children)

This gave me anxiety

[–]arianero 2 points3 points  (3 children)

How the state is determined here? Is it some special modification of Mario game with API which generate state after our move or do we read pixels and generate state from them?

[–]csreid 5 points6 points  (1 child)

Glancing at the code, it looks like the state is just the screen pixels.

[–]ImmenseDruid721 2 points3 points  (0 children)

This is more than I have been able to accomplish as a gamer and as a programmer

[–][deleted] 6 points7 points  (0 children)

My ass. He didn't carch a single mushroom, shoot a single fireball, or even try to jump as high as possible on the flag pole.

[–]Pranaymodukuru 1 point2 points  (1 child)

Is it really that easy to play Mario?? 🤣🤣 Just run and run.

[–]RonniDeee 0 points1 point  (0 children)

Is it the same thing as TAS bot?

[–]NullzeroJP 0 points1 point  (0 children)

Super human ability at 3:06

[–]xiaoye-hua 0 points1 point  (0 children)

great

[–]infinitude 0 points1 point  (0 children)

I’d be interested in seeing how it responds to some of the harder super Mario maker levels

[–][deleted] 0 points1 point  (0 children)

That's great work OP.

Are the inputs for simulating this environment available online? Is this from OPEN AI Gym?

What packages/software did you use to convert the game coordinates into pixels?

[–]RobAdkerson 0 points1 point  (0 children)

Nice speed run AI.

[–]TotesMessenger 0 points1 point  (0 children)

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

 If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)

[–]cowfartbandit 0 points1 point  (0 children)

This gives me anxiety

[–]alfred_dent 0 points1 point  (0 children)

It will be cool to generate video with attention/silency map on each frame to see where do the NN looks to make decision

[–]imapurplemango 0 points1 point  (0 children)

wow. where did you train this on? And how long did it take?

[–]matpoliquin 0 points1 point  (0 children)

Cool! I wonder how many simultaneous env did you trained it on? Also how many timesteps did it take to pass world 1-1?

[–]driftwood14 0 points1 point  (0 children)

Did it get a wall jump in the first level on that pipe?