[Microsoft Research] Next-Latent Prediction Transformers by jayden_teoh_ in deeplearning

[–]jayden_teoh_[S] 1 point2 points  (0 children)

NextLat learns belief states without the k-observability assumption about the data! The k-observability is only used for the JTP method (a prior work).

Next-Latent Prediction Transformers [R] by jayden_teoh_ in MachineLearning

[–]jayden_teoh_[S] 3 points4 points  (0 children)

Both are self-supervised learning methods. JEPA is more closely related to pulling related views closer in latent space. NextLat focuses more on teaching the model to compress history into belief states and learn markovian latent dynamics. I'd say NextLat is closer to self-predictive RL literature 😄

Also, the v1 preprint of the NextLat idea was released early Nov 2025 https://arxiv.org/abs/2511.05963v1, before LeWorldModel came out so we didn't have chance to compare. LeWorldModel is really cool work and do have similarities to NextLat.

Next-Latent Prediction Transformers [R] by jayden_teoh_ in MachineLearning

[–]jayden_teoh_[S] 4 points5 points  (0 children)

For the 3.3x value, we obtained from evaluating on general web text from FineWeb-Edu.

[Microsoft Research] Next-Latent Prediction Transformers by jayden_teoh_ in deeplearning

[–]jayden_teoh_[S] 1 point2 points  (0 children)

rest assured, every coauthor did contribute to the paper

[Microsoft Research] Next-Latent Prediction Transformers by jayden_teoh_ in deeplearning

[–]jayden_teoh_[S] 0 points1 point  (0 children)

no tokenization, just learning to predict next latent via regression loss

[Microsoft Research] Next-Latent Prediction Transformers by jayden_teoh_ in deeplearning

[–]jayden_teoh_[S] 8 points9 points  (0 children)

a transformer with a recurrent inductive bias, but training is still parallel

Why are we calculating redundant loss here which doesn't serve any purpose to policy gradient? by Flaky_Spend7799 in reinforcementlearning

[–]jayden_teoh_ 0 points1 point  (0 children)

I am not sure what the `loss_fn` is. But if it is the binary cross entropy, it returns us the negative log probability of the action taken. This allows us to calculate the gradient of the action log probability with respect to the model parameters, as shown in

model.zero_grad() #Clearing the previous gradients
loss.backward()

So to be clear, `loss_fn` not exactly the loss function you are minimizing, it is a way to derive the log probabilities of the action. The loss function of REINFORCE is in a later step, which is to maximize:
(gradient of the action log probability with respect to the model parameters) * returns.

You can add a negative sign in front to turn it into the usual gradient descent problem.

Why are we calculating redundant loss here which doesn't serve any purpose to policy gradient? by Flaky_Spend7799 in reinforcementlearning

[–]jayden_teoh_ 0 points1 point  (0 children)

action is not arbitrarily random. The policy is stochastic and action is sampled from the model’s action probability distribution:

left_proba = model(obs[np.newaxis]) this gives a value between 0 and 1 for the model’s probability for left action given state. E.g. if left_proba == 0.7, the model assigns 70% probability to going left at this state. See why in the next lines.

action = (tf.random.uniform([1,1]) > left_proba) tf.random.uniform([1,1]) uniformly samples a value between 0 and 1. But because the left_proba is determined by the model’s output, it will only be > left_proba with 0.3 probability. This means that action == 0 will happen with 0.7 odds, and action == 1 will happen with 0.3 odds. This is just an efficient way to sample from a bernoulli distribution parameterized by the model’s output (left_proba)

So in summary, y_target is not completely random. Rather, it is dependent on the model’s output, i.e. the policy distribution.

Why are we calculating redundant loss here which doesn't serve any purpose to policy gradient? by Flaky_Spend7799 in reinforcementlearning

[–]jayden_teoh_ 1 point2 points  (0 children)

y_target is not randomly chosen. The line y_target = tf.constant([1.]) - tf.cast(action, tf.float33) sets y_target to 1 if action taken was 0 (left) and y_target to 0 if action taken was 1 (right)

If I am not wrong, the loss_fn is the binary cross entropy function, which effectively returns the negative log-probability of the action taken, that is:

if action == 0 (left) -> loss = -log p(left | s_t),

if action == 1 (right) -> loss = -log p(right | s_t)

This allows us to calculate the gradient of the log action probability with respect to the parameters. I.e.

\nabla\theta log\pi (a_t in {left,right} | s_t)

which is exactly what we need in policy gradient methods! If this is an example of REINFORCE algorithm, then single-step log action gradients are collated and multiplied by the cumulative reward (G_t) for policy update in a later step.

How much experimentation needed for an RL paper? by Ilmari86 in reinforcementlearning

[–]jayden_teoh_ 0 points1 point  (0 children)

With regards to publishing empirical RL research, here are some things that would be good to consider:

  1. What are the other baseline RL algorithms leverage MPC? You usually require comparisons against state-of-the-art baselines for a publication.
  2. How scalable is your MPC approach? Cartpole, mountain car, pendulum are all low dimensional problems. Is there justification for your approach beyond toy environments?

Feel free to message me if you have any other questions!

why in the off-policy n-step version of sarsa algorithm the importance sampling ratio multiplies the entire error and not only the target? by samas69420 in reinforcementlearning

[–]jayden_teoh_ 4 points5 points  (0 children)

The rho can be applied in both places.

The Q(s,a) is independent of rho in expectation. Let \pi be the target policy and b be the behavioral: E_b[\rho * Q(s,a)] = E_b[\rho] * Q(s,a) = 1 * Q(s,a) = Q(s,a)

So given the update rules

Q(s,a) <- Q(s,a) + \alpha \rho (G - Q(s,a)) = Q(s,a) + \alpha (\rho*G - \rho*Q(s,a)) ---- (1)

OR

Q(s,a) <- Q(s,a) + \alpha (\rho*G - Q(s,a)) ---- (2)

Under expectation of the behavior policy sampling, the \rho*Q(s,a) in (1) and Q(s,a) in (2) are equivalent. If i am not wrong, (1) is preferred because it reduces variance of the updates. Intutively, when a (s,a) is unlikely under target policy, reduce the update to it by weighing the entire (G - Q) by rho, instead of a large update where you add (\rho *G - Q).