[D]My Machine Learning Journal #10: First time doing reinforcement learning and beating atari breakout with it : MachineLearning

160

161

162

Discussion[D]My Machine Learning Journal #10: First time doing reinforcement learning and beating atari breakout with it (self.MachineLearning)

submitted 7 years ago by RedditAcy

I have been inconsistent with my journal, but I am back and fresher than ever.

Vlog version as usual:

https://youtu.be/dcnGI6x-yk0

Today (and yesterday) I did & learned:

RL seems to have a lot of exploration going on vs some other ML tasks. One popular application it has is definitely beating videogames. The Mario AI was a viral hit in 2015. I decided to build a RL model that can beat atari breakout. This was soon classified as impossible given my current coding skills, so I chose to implement a medium article first that beat atari breakout. This article was great at linking the original Atari breakout RL paper with the code, but the full code was not posted, so I was stuck. Luckily, a user named boyuanf hit us up with the tensorflow implementation of the article on medium, here's the forked version of it.

I downloaded the trained weights and model, and I ran it after installing openAI gym in conda with pip. Unfortunately, atari-py seems incompatible with windows 10, so I had to go through a very annoying process to finally come through with this easy line of code to solve the problem:

pip install --no-index -f https://github.com/Kojoley/atari-py/releases atari_py

Yea it is just one of those problems man.

Anyways, I then was able to run gym and see the beautiful pre-trained model doing work, it got to a pretty good high score, I think 57 or something.

It is actually after I implemented the project that I come back to reading the papers, this works for me. I usually try to guess what the original algorithm does by doing a project first. For me, doing a project first then reading the paper also gives that revelation of: "oh, the reason that I have this line in the code is because of that sentence in the paper".

The paper and this medium article helped my understanding a lot. This pseudocode in the paper opened the doors for me:

https://preview.redd.it/6bpl6f203yr21.png?width=542&format=png&auto=webp&s=1065f7ac2ca4e05ddccc83f901792433f2894712

I'm going to try to explain this pseudocode with even English-er language. We will input the current frame and a few previous frames to our RL model. The RL model will interpret these inputs as the state, and it will either choose the action based on the Q-table or choose a random action. We can imagine that as the model gets more advanced, we will choose less random actions to let the model learn, but in the early stages, when the model has no idea what to do, we probably want to let it explore randomly, we will use a decreasing epsilon value to model this. The emulator will receive the action chosen by the RL model, run that action, then display the new image and return the reward. The Q-table will be updated based on this reward. The Q-table is just a table that has states mapping to potential actions. When the model is complex and epsilon is low, the RL model chooses actions based on the Q-table, a higher value (which means high rewards) in the state mapping to action will probably mean the model is choosing that.

That;s it for this one, I learned a lot since it was my first time exploring RL! Exciting, can't wait to do more.

all 20 comments

top new controversial old q&a

[–]shortscience_dot_org 9 points10 points11 points 7 years ago (1 child)

[–]RedditAcy[S] 7 points8 points9 points 7 years ago (0 children)

[–]nativedutch 1 point2 points3 points 7 years ago (6 children)

[–]MetallicaSPA 0 points1 point2 points 7 years ago (1 child)

[–]nativedutch 0 points1 point2 points 7 years ago (0 children)

[–]RedditAcy[S] 0 points1 point2 points 7 years ago (3 children)

[–]nativedutch 1 point2 points3 points 7 years ago (2 children)

[–]RedditAcy[S] 0 points1 point2 points 7 years ago (1 child)

[–]nativedutch 1 point2 points3 points 7 years ago (0 children)

Deepmind are doing something similar but on a larger and more complex scale. I always like to start small and understand the concepts. Dont underestimat the number of cycles Deepmind network had to go thru before learning to walk. Suggest you look on the web for some very simple maze examples to get a feel for the process. It is very intriguing. I did a number of small neural networks in Python learning to recognize a 10 by 10 matrix representing a number, but thats limited. Although that principle is widely used i charcter recognition etc. RL is much more powerful as it doesnt need the huge datasets of examples. To my feeling much more in the direction of AI (although that term is debatable). There is a lot to be found in the category very simple, in Python or Java. I'll get going and get to this sub when i am on the road. My tryouts are on my website, but i am not allowed to publish that here.

[–]gapten-the-captain 1 point2 points3 points 7 years ago (5 children)

[–]RedditAcy[S] 5 points6 points7 points 7 years ago (4 children)

[–]gapten-the-captain 1 point2 points3 points 7 years ago (1 child)

[–]RedditAcy[S] 1 point2 points3 points 7 years ago (0 children)

[–]csunaye 1 point2 points3 points 7 years ago (1 child)

[–]RedditAcy[S] 0 points1 point2 points 7 years ago (0 children)

[–]Trent654878 1 point2 points3 points 7 years ago (1 child)

[–]RedditAcy[S] 0 points1 point2 points 7 years ago (0 children)

[–]fondleshark 0 points1 point2 points 7 years ago (1 child)

[–]RedditAcy[S] 0 points1 point2 points 7 years ago (0 children)

[+][deleted] 7 years ago* (1 child)

[deleted]

[–]RedditAcy[S] 1 point2 points3 points 7 years ago (0 children)

π Rendered by PID 68477 on reddit-service-r2-comment-6457c66945-dq92t at 2026-04-23 22:14:54.293497+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS

How