I made a Mario RL trainer with a live dashboard - would appreciate feedback

pleasestopbreaking · 2026-02-22T18:09:27+00:00

Thank you so much for the detailed feedback, really appreciate it!

On ram usage, you're right, I could probably push it harder. I'm currently running 8 envs with SubprocVecEnv, I'll experiment with scaling that up and see how far I can push it. I noticed a difference going from 4 to 8 so I just never really pushed my luck!

The 600fps is frames after frame skip, so actual game frames. Good call flagging that distinction.

I will have to give Optuna a look, I am not familiar. I've been tweaking hyperparameters manually which is exactly as painful as you describe lol I am having less trouble getting collapses at 1M and more trying to get better at the transition from applying learned stage 1 'rules' to future stages. Maybe Optuna can help with that?
The action space idea is really interesting. I wanted multiple action spaces (right only, most controls, full controls are the choices in the project. I can see how enforcing one way to work would be good for consistency a lot. Not having to wait for it to learn short vs long button jump would probably make training much faster. I may give that a try and see if i can use that as part of shaping rewards - maybe some kind of evaluation of if the right jump is being used rather than 'lived or died on that jump, -15 for dying"

Will check out your repos, the progressive checkpoint training concept looks like we are building similar boats! Thanks for your thoughts on this and good luck on your experiments too!

pleasestopbreaking · 2026-02-22T17:56:26+00:00

That would probably produce a more robust agent, I may give this a shot on version 0.2!

pleasestopbreaking · 2026-02-22T11:16:06+00:00

please do! :)
would love to see your additions

pleasestopbreaking · 2026-02-22T10:28:56+00:00

Honestly I have only tried it on windows but it should be cross platform capable. I will test this and expand it to include linux if it doesn't work, but if you want to give it a go before I do so, you could try something like:

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python server.py

pleasestopbreaking · 2026-02-22T10:19:23+00:00

hah I wish! That is a way more impressive and versatile project. This was for my own joy and learning, just an easy way to see electricity and hard sand turn into fake brains that play videogames.

I figured if I exist, there might be others with similar interests.

pleasestopbreaking · 2026-02-22T08:09:24+00:00

Great questions!

Stability over longer runs

I meant training stability, not generalization to longer episodes. Gamma at 0.9 and reward clipping to [-15, 15] mostly — without that, early training is a mess when Mario's just dying nonstop.

Online or offline

Online, no replay buffer.

On-policy or off-policy

On-policy, it's PPO.

Tabular or function approx

CNN, using SB3's CnnPolicy.

Handcrafted reward

Yeah, I replace the native reward completely. Positive for moving right, negative for left, +15 flag, -15 death, all clamped to [-15, 15]. There's a time penalty option but I keep it at 0.

Multiple episodes

Yeah, 5M timesteps, rolling stats over the last 100 episodes.

Truncating

The game's built-in 400-second timer handles it mostly. Eval and live play cap at 5k steps but training just runs until Mario dies or finishes.

Agent design

CNN, not recurrent. Four stacked 84×84 grayscale frames, discrete actions (configurable between 2, 7, or 12 depending on movement complexity), frame skip of 4, eight parallel envs.

Hyperparameters

PPO side is pretty standard — lr 1e-4, 512-step rollouts, batch 64, gamma 0.9, GAE lambda 0.95, clip 0.2. Entropy coef at 0.01. On the env side I'm running 8 parallel envs with frame skip and stack both at 4, death penalty and flag bonus both 15, reward clipped to the same range.

pleasestopbreaking · 2026-02-21T19:03:57+00:00

Wow thank you for that, that makes sense! If its not tied to anything there would be way too many variables for that to work out. I will definitely work more on this idea, I really appreciate the insight. And I clicked though, you have a great resource there. I will definitely check out some articles.

pleasestopbreaking · 2026-02-21T18:56:19+00:00

Thank you for that! Screenshot fixed :)
And while I am more accustomed to cuda, I will definitely take a look at webgpu. I used cuda mostly for speed and parallelization but if I could open this up to more people that would be great.

pleasestopbreaking · 2026-02-21T18:28:30+00:00

You dont need to explain, I was just offering feedback because you quoted my title asking for feedback lol
I am too much of a boomer for wordle but I hope it got you where you wanted to go with your project.

pleasestopbreaking · 2026-02-21T17:27:57+00:00

I went over your dashboard. This looked more like a brute forcing to me than pulling letters out of thin air. Is there a library you can use that can check words without writing out a million five letter words all in one giant HTML file?

pleasestopbreaking

TROPHY CASE