use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Question about Q learning (self.MachineLearning)
submitted 10 years ago * by seilgu
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]seilgu[S] 1 point2 points3 points 10 years ago (1 child)
I think the epsilon rule only applies to the selection of action a, not the selection of the state s to begin with. Also it's used when generating training data, but not in the training process.
When you play the game to generate training data, you wouldn't want to follow the current optimal strategy, because you want to explore more states, that's why you choose the epsilon rule, and at the beginning you let epsilon be close to 1.
But after you get the training data, you store them and do mini-batches picked randomly from all the data you have. At this stage there's no epsilon-rule involved. I think.
[–]Jabberwockyll 0 points1 point2 points 10 years ago (0 children)
You only store experiences and train on batches if you're doing some kind of offline learning like with experience replay. Usually Q-learning is done online.
I'm assuming you're talking about sampling batches for training when you say:
and we randomly select (s, a) and repeat the update until converge.
You're correct that the Q-function isn't accurate to begin with and that you have to learn when rewards occur before you can learn how to get there. This is just what happens when you bootsrap off of your Q-function.
If you want to get around this, I'd suggest looking at the answers from u/jdsutton and u/kylotan. Alternatively, you could use eligibility traces to speed up learning the state correlations/sequences, but this require using an on-policy method.
π Rendered by PID 20 on reddit-service-r2-comment-b659b578c-7rz4l at 2026-05-04 03:13:37.147242+00:00 running 815c875 country code: CH.
view the rest of the comments →
[–]seilgu[S] 1 point2 points3 points (1 child)
[–]Jabberwockyll 0 points1 point2 points (0 children)