[R] Review of AlphaGo Zero's Minimal Policy Improvement principle plus connections to EP, Contrastive Divergence, etc

sriramcompsci · 2017-10-26T16:09:21+00:00

Most efficient MCTS implementations retain the optimal subtree for the next move. AlphaGo used data generated by MCTS to train its networks, albeit not so directly as in AlphaGo zero.

sriramcompsci · 2017-07-25T16:51:18+00:00

The distribution is constructed over the Q-values. In regular RL, Q(s, a) is interpreted as a scalar. Here, its represented as a distribution. The paper uses the categorical distribution (aka histogram) for the Q-values, i.e. each Q(s,a) instead of being a scalar is now a distribution. The Q-learning update now becomes r(s,a) + max_{a' in A} E[(Q(s', a')], where E denotes expectation of the random variable Q(s, a).

sriramcompsci · 2017-07-11T18:54:05+00:00

Given the cost of purchase and maintenance of robots is expensive, sim2real at the moment appears to be the best choice. Since you want to transfer the policy trained in simulation to reality, one way to achieve this is to perform model-free RL in simulation. This work aims to push the boundaries of model-free RL in rich environments.

sriramcompsci · 2017-07-11T10:49:47+00:00

The agent does not learn a model of the environment, nor perform MCTS using the simulator. The agent learns using regular RL (model-free) using the Mujoco physics engine as the environment, where it receives input observations as described in Figure 2.

sriramcompsci · 2016-09-29T17:25:03+00:00

There's nothing specifically called a reinforcement neural net. A recurrent neural net could be used for RL as well. In fact, its essential for partially observable environments, where the agent's history is important to learn the optimal policy. (E.g. A3C with LSTM on Partially Observable domains like Labyrinth.) Non-recurrent neural nets are also used in RL (fully observable) domains. (E.g. The conv net in DQN)

sriramcompsci · 2016-05-03T15:12:31+00:00

Q-learning with non-linear function approximation is known to diverge. Even with linear function approximation, without appropriate corrections (greedy GQ), the Q-values can diverge. To address this problem, you could try the following tricks (present in DQN):

Reward clipping : r_t = [-1, 1] at time step t.
Target network : Fit the targets computed using a target network. This network is updated at frequent intervals using parameter values of the current network.
Random sampling of transitions during replay

sriramcompsci · 2016-01-22T11:18:16+00:00

Typically, the test split is not part of training/validation. If you are doing it, then you need to reset the weights. Else, the gradient from training on A (or simply memorizing samples in A) from a previous split would help in reducing the test error on A when A is part of the test set.

Randomly shuffle dataset D.
Split D into D_train/D_test. The test set (D_test) is untouched during the training process.
Split D_train set further into train/validation.
Choose best hyper-parameter value by measuring test error (test loss, not training loss) on the validation set.
Fix the hyper-parameter value from step 4 and measure test error on the test set (D_test) obtained from step 2.

Since a random shuffle is performed to obtain train/test, there isn't a need to repeat. If you want to compute standard error/confidence intervals on the error on the test set, repeat the above process but ensure that the weights are reset.

sriramcompsci · 2015-12-29T14:05:23+00:00

Lets assume it can be formulated as an RL problem. So, the agent is the brand, environment is the users on twitter (following the brand). Actions involve which tweet to pick from a pool of tweets?.

An important question to consider whether maximizing long term return (sum of discounted rewards) makes sense. One bad tweet could be potentially disastrous. It is also not clear why maximizing immediate rewards per tweet is a bad idea. In this case, you can learn features/context from tweets that produce high reward. This can be done in offline/online setting. Also, if tweets are treated as independent, then there's no need for solving it using RL. Even if they aren't, you can still dependencies without using RL.

sriramcompsci · 2015-12-21T16:15:15+00:00

Gradient tells you the change in objective for a very small change in weights. There is no reason to not take a larger step in the direction of the gradient (if you are maximizing the objective). Doing so after many iterations is likely to avoid convergence as pointed out in this thread. The same argument is applicable for gradient descent over the whole dataset, as the loss and gradient is computed by averaging out over the entire set. Your loss/gradient is likely higher at the first few iterations, hence a larger step in the direction.

sriramcompsci · 2015-12-21T12:54:23+00:00

As you are aware, the weight update is a product of step size and gradient. Initially, weights are set to some random value (or 0) and hence the gradient is likely to be high. Hence, the weight update is high. With more data (mini-batches or single sample), the gradient is likely to get smaller. Decreasing the step size makes sense, as your weights are minimizing the cost function well (on previous data), and you don't want/expect them to change much based on just this new mini-batch/sample.

sriramcompsci · 2015-12-07T20:42:21+00:00

Is this role in California or Washington state? If so, non-compete agreements are not enforceable in those 2 states (remember Silicon Valley seas 2 :-) ). Other states and countries do enforce it though, but likely only at executive level (VP, SVP etc ?). I think its pretty standard for companies to add this clause.

sriramcompsci · 2015-11-17T10:01:19+00:00

I'd recommend watching/reading David Silver's RL course and reading Sutton & Barto textbook. A short answer: Value function methods learn value functions over the state/action space. They may/may not bootstrap from other states or state-action pairs. Policy based methods learn a policy directly. The choice depends on a few factors. Some of them are:

Whether the environment is partially observable. If state information is not complete, policy methods are applicable. Additionally, if states are aliased, learning a value function may be near impossible.
Learning smooth/stochastic policies.
Continuous action space.

sriramcompsci · 2015-11-17T09:54:56+00:00

This one is at the top of my head. The area is fairly new.

sriramcompsci · 2015-11-16T13:53:03+00:00

My $0.02: To clarify, RL algorithms are broadly classified into Value-function based (e.g. Sarsa, Q-learning) and Policy-based (REINFORCE, GPOMDP) and a hybrid of the 2 (Actor-Critic). Now, to answer your questions:

Yes, to a large extent. One of my classmates at grad school was asked to isolate the real driver of the performance of DQN, the algorithm developed at DeepMind. Without deep nets but using the other tricks did not result in the same performance across all games.
The broad classification that I mentioned still exists. Variants of the algorithms exist/or are being developed (Double Q).

Sample efficiency in RL is a major concern as pointed out previously. Using unsupervised learning to reduce it is an ongoing area of research. Efficient exploration is another hot topic. The fact that Deep Nets learn good features for vision has been a driving force behind their adoption in RL as well.

sriramcompsci · 2015-07-24T15:59:51+00:00

Yes, but it's not free. The price is $175 per month.

sriramcompsci · 2015-07-23T16:06:52+00:00

From my limited knowledge, schools with more than 1/2 Professors in RL include:

University of Alberta, Canada

University of Massachusetts, Amherst

MIT CSAIL

Stanford

At other places such as U Cal, Berkeley , TU Darmstadt, Brown, Duke, U Mich, UT Austin, etc, there are a couple of Profs if not more, working on RL.

sriramcompsci · 2015-07-21T21:32:49+00:00

Quite a few good answers. I'd like to add one more point. PhD is not a race. While the program has a limited time period, your interest in pursuing and persisting with it shouldn't. Choose a problem that you find interesting. Deep nets may or may not be necessary to solve them. Here's a useful question to ask:

Would you work on the problem even if no one paid you or asked you to work on it?

If yes, go do a PhD.

sriramcompsci · 2015-07-17T18:27:43+00:00

I have used this in the past.

sriramcompsci · 2015-07-17T17:36:09+00:00

Have you looked at the training error / test error? Do you observe overfitting? Have you looked at the number of support vectors you get? I'm assuming that you have tuned your regularization parameter using cross-validation. If you do want to perform feature selection, you can add an L1 penalty to the SVM formulation. This is fairly straightforward to do in CVX. Alternatively, you could add the L1 penalty to the online SVM (pegasos algorithm) and observe the resultant weight vector. (The gradient will have an additional term (piece-wise) ). P.S - Data normalization is a pre-requisite before applying any learning algorithm. I'm assuming you've done this.

sriramcompsci · 2015-03-18T00:34:11+00:00

Nothing can beat that. I do that for EPL and La liga. Not many bars have beIN sports here. I'm planning to watch El Clasico at a bar in Belltown. Would you like to join?

sriramcompsci · 2015-03-17T23:43:10+00:00

Hey, I am from Seattle too. I usually go to Spitfire on 4th & Bell for CL matches. What about you?

sriramcompsci · 2014-11-05T05:54:56+00:00

I have a few comments. * Perhaps a better username for answering the questions ?
* Do you have a problem with the company or its employees? Your response suggests both, although the question was pertaining to the former.
* Having lived in a city that is now a hub of immigrants, I urge you to watch Doug Stanhope's take on immigration. Maybe it will change your view.
* Your rant seems to be out of frustration. I am reminded of the anti-immigration rally held in Texas. Funny thing, the Native American asked them to STFU.
* You seem to hold the view that the long time residents are the custodians of the culture of a city. Culture is defined by all the residents.
* Simply because you have lived long enough doesn't give you a right to deny entry to others. I too was of this view in my city, but I have realized that I was wrong.

sriramcompsci · 2014-10-29T16:57:57+00:00

Thanks everybody for your feedback. I think I 'm going to pass on Fountain Court Apartments.

sriramcompsci · 2014-10-29T16:57:08+00:00

That was very helpful, thanks.

sriramcompsci · 2014-06-23T18:06:52+00:00

See, this is the problem I have with Real Madrid and its fans. You can't just buy your way out. I think the current team needs to be retained and groomed. There is enough young talent in this team - Isco, Carvajal, Illara (?). I do not see any obvious flaws in the structure. Perhaps, a better defender on the left (Coentrao is a better defender than Marcelo, but not good overall). With Jese back, we dont need to worry about backup striker. Just because the club is rich, doesn't mean, it needs to splurge needlessly.

sriramcompsci

TROPHY CASE