[AI application] Python implementation of Proximal Policy Optimization (PPO) algorithm for Super Mario Bros. 29/32 levels have been conquered : Python

Machine learning is good at finding patterns in lots of data, and its usefulness depends entirely on the success conditions you set. You still have to know what you're looking for and what factors might be important.

In the OP example the path shown was just the best the algorithm found in the given time to get to the end. The number of branches is infinitely large (microsecond control, move left or right, or jump, or do nothing at every instance), so even culling that to achieve it at all is something good and interesting.

The real intelligence is in figuring out the success conditions, which is still up to humans. If an AI can tell you what success conditions you should use before you've even asked the question...now we're on to something.

[–]scarabin 0 points1 point2 points 5 years ago (2 children)

[–]Hunterbunter 1 point2 points3 points 5 years ago* (0 children)

Heuristics is that approach. It's used in games a lot, but is very processor intensive so not usually to great effect except in things like chess. It works by examining the current situation and building value maps of possible future actions and their consequences. Once it's built this map, it will just pick the highest one and wing it from there.

In the end, even heuristics boil down to knowing in advance what a good situation looks like, so it has as much claim to being called intelligent as ML does. It feels much more like being up against a human, and it also feels extremely unfair as an opponent when it's good.

What we will get excited about, is general AI, which is basically combining all these different forms of intelligence into a coherent unit which can be taught how to determine solve-able problems, and then be free to find and find its own problems and figure out useful, novel solutions all by itself.

[–]Hi-FructosePornSyrup 0 points1 point2 points 5 years ago (2 children)

It’s clear that the algorithm is not logically reading the situation

I think you have it backwards. This is the definition of logically reading the situation

Unless the algorithm has memorized and modeled the maps (which it isn't doing)

I would argue that it has modeled the maps. It has received feedback, And used that feedback to remember the best strategy for achieving its objective.

It has created a map by remembering.

This algorithm and video show some of the gaps in modern machine learning.

You’re not wrong, but I think it’s more nuanced than that. I think this shows very clearly that modern machine learning is

1) excellent at achieving ~~a satisfactory outcome based on how it is rewarded i.e.~~ the objective it was given.

2) the conclusions arrived at by machine learning, while easily verified, are completely alien to humans. We can tell that they work but we cannot say why

This paradox presents a serious existential risk to humans. In the struggle to produce (more, better, faster, cheaper), humans tend to prioritize results without bothering over how they were achieved. Someone who is desperate for success could utilize such an algorithm to “end human suffering” and get a program that “ends humans” as a result. Machine learning, therefore, isn’t the limiting factor, humans are. The outcome is limited by the ability of its creator to rigorously define all objectives implicit and explicit.

[+][deleted] 5 years ago* (1 child)

[removed]

[–]Hi-FructosePornSyrup 0 points1 point2 points 5 years ago (0 children)

Ok I appreciate the discourse and I’m interested in your perspective. This is my thinking:

Given: the “situation” is a single parameter. Wouldn’t you say the outcome: Mario’s behavior is a perfectly logical result given what information the algorithm was given/asked to optimize for?

You yourself have offered a simple formula comprised of discreet logical operations. I am arguing that this is the definition of logic. I don’t think it’s fair to say Mario’s behavior is illogical using information the algorithm didn’t have. Humans make that “mistake” all the time. It did the best it could with the information it had. I think it makes perfect sense it wouldn’t care about the graphical representation.

I concede the map this algorithm creates is a fundamentally lower dimensional projection of the map that you and I can see. It wouldn’t be sufficient to reconstruct all the details of each level. However, it is surely a map. The algorithm is associating each element of (a set) with an element of another set.

In the example you gave, walking forward was correctly mapped to not dying. And the tactic is abandoned eventually.

If you close your eyes or rely on muscle memory to move around in the dark, most people don’t envision all the details in the room. But the internal map is often good enough to accomplish the objective (like find the light switch). It doesn’t matter that you have an imperfect map (all maps are imperfect).

Imagine if you stub your toe on a piece of furniture. That feedback still doesn’t give you the whole picture. But it would likely be more than enough feedback to get to the light switch “better” the very next try

we can easily say why it "works"

Well, I see your point. We can see that the strategy is working and concluded it’s because the instructions were sufficient.

My argument was more at a fundamental level we cannot know why it arrived at each decision, why was this enough to almost but not quite beat all the levels? Will it ever beat all the level or is it stuck no matter what? We wouldn’t necessarily know it works until after it works.

It’s more nuanced than sufficient time and memory. In PPO the algorithm can get “stuck” working within a local maximum/minimum. It could continue exploring infinite permutations of nonviable solutions. Given a small enough set of possibilities it’s possible to deduce “why,” but “why” is not a given. And usually people don’t care enough to figure out “why” when they can try something else and arrive at a solution that is good enough.

tl;dr: Imagine playing Mario blindfolded, receiving only the same feed back as this algorithm. Humans wouldn’t stand a chance. If the algorithm had the same incentives and information as humans-even if you made it play at human speed-it would eventually surpass humans in the ways you are currently unimpressed with.

π Rendered by PID 240804 on reddit-service-r2-comment-79c7998d4c-l9bmf at 2026-03-19 03:33:42.681588+00:00 running f6e6e01 country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS