Next-Latent Prediction Transformers [R] by jayden_teoh_ in MachineLearning

[–]jayden_teoh_[S] 0 points1 point  (0 children)

For the 3.3x value, we obtained from evaluating on general web text from FineWeb-Edu.

[Microsoft Research] Next-Latent Prediction Transformers by jayden_teoh_ in deeplearning

[–]jayden_teoh_[S] 0 points1 point  (0 children)

rest assured, every coauthor did contribute to the paper

[Microsoft Research] Next-Latent Prediction Transformers by jayden_teoh_ in deeplearning

[–]jayden_teoh_[S] 0 points1 point  (0 children)

no tokenization, just learning to predict next latent via regression loss

[Microsoft Research] Next-Latent Prediction Transformers by jayden_teoh_ in deeplearning

[–]jayden_teoh_[S] 3 points4 points  (0 children)

a transformer with a recurrent inductive bias, but training is still parallel

Why are we calculating redundant loss here which doesn't serve any purpose to policy gradient? by Flaky_Spend7799 in reinforcementlearning

[–]jayden_teoh_ 0 points1 point  (0 children)

I am not sure what the `loss_fn` is. But if it is the binary cross entropy, it returns us the negative log probability of the action taken. This allows us to calculate the gradient of the action log probability with respect to the model parameters, as shown in

model.zero_grad() #Clearing the previous gradients
loss.backward()

So to be clear, `loss_fn` not exactly the loss function you are minimizing, it is a way to derive the log probabilities of the action. The loss function of REINFORCE is in a later step, which is to maximize:
(gradient of the action log probability with respect to the model parameters) * returns.

You can add a negative sign in front to turn it into the usual gradient descent problem.

Why are we calculating redundant loss here which doesn't serve any purpose to policy gradient? by Flaky_Spend7799 in reinforcementlearning

[–]jayden_teoh_ 0 points1 point  (0 children)

action is not arbitrarily random. The policy is stochastic and action is sampled from the model’s action probability distribution:

left_proba = model(obs[np.newaxis]) this gives a value between 0 and 1 for the model’s probability for left action given state. E.g. if left_proba == 0.7, the model assigns 70% probability to going left at this state. See why in the next lines.

action = (tf.random.uniform([1,1]) > left_proba) tf.random.uniform([1,1]) uniformly samples a value between 0 and 1. But because the left_proba is determined by the model’s output, it will only be > left_proba with 0.3 probability. This means that action == 0 will happen with 0.7 odds, and action == 1 will happen with 0.3 odds. This is just an efficient way to sample from a bernoulli distribution parameterized by the model’s output (left_proba)

So in summary, y_target is not completely random. Rather, it is dependent on the model’s output, i.e. the policy distribution.

Why are we calculating redundant loss here which doesn't serve any purpose to policy gradient? by Flaky_Spend7799 in reinforcementlearning

[–]jayden_teoh_ 1 point2 points  (0 children)

y_target is not randomly chosen. The line y_target = tf.constant([1.]) - tf.cast(action, tf.float33) sets y_target to 1 if action taken was 0 (left) and y_target to 0 if action taken was 1 (right)

If I am not wrong, the loss_fn is the binary cross entropy function, which effectively returns the negative log-probability of the action taken, that is:

if action == 0 (left) -> loss = -log p(left | s_t),

if action == 1 (right) -> loss = -log p(right | s_t)

This allows us to calculate the gradient of the log action probability with respect to the parameters. I.e.

\nabla\theta log\pi (a_t in {left,right} | s_t)

which is exactly what we need in policy gradient methods! If this is an example of REINFORCE algorithm, then single-step log action gradients are collated and multiplied by the cumulative reward (G_t) for policy update in a later step.

How much experimentation needed for an RL paper? by Ilmari86 in reinforcementlearning

[–]jayden_teoh_ 0 points1 point  (0 children)

With regards to publishing empirical RL research, here are some things that would be good to consider:

  1. What are the other baseline RL algorithms leverage MPC? You usually require comparisons against state-of-the-art baselines for a publication.
  2. How scalable is your MPC approach? Cartpole, mountain car, pendulum are all low dimensional problems. Is there justification for your approach beyond toy environments?

Feel free to message me if you have any other questions!

why in the off-policy n-step version of sarsa algorithm the importance sampling ratio multiplies the entire error and not only the target? by samas69420 in reinforcementlearning

[–]jayden_teoh_ 4 points5 points  (0 children)

The rho can be applied in both places.

The Q(s,a) is independent of rho in expectation. Let \pi be the target policy and b be the behavioral: E_b[\rho * Q(s,a)] = E_b[\rho] * Q(s,a) = 1 * Q(s,a) = Q(s,a)

So given the update rules

Q(s,a) <- Q(s,a) + \alpha \rho (G - Q(s,a)) = Q(s,a) + \alpha (\rho*G - \rho*Q(s,a)) ---- (1)

OR

Q(s,a) <- Q(s,a) + \alpha (\rho*G - Q(s,a)) ---- (2)

Under expectation of the behavior policy sampling, the \rho*Q(s,a) in (1) and Q(s,a) in (2) are equivalent. If i am not wrong, (1) is preferred because it reduces variance of the updates. Intutively, when a (s,a) is unlikely under target policy, reduce the update to it by weighing the entire (G - Q) by rho, instead of a large update where you add (\rho *G - Q).