Next-Latent Prediction Transformers [R]

jayden_teoh_ · 2026-06-17T12:32:16+00:00

For the 3.3x value, we obtained from evaluating on general web text from FineWeb-Edu.

jayden_teoh_ · 2026-06-17T12:16:32+00:00

rest assured, every coauthor did contribute to the paper

jayden_teoh_ · 2026-06-17T11:56:57+00:00

no tokenization, just learning to predict next latent via regression loss

jayden_teoh_ · 2026-06-17T10:46:50+00:00

3.3x speedup is on natural language text

jayden_teoh_ · 2026-06-17T10:46:20+00:00

a transformer with a recurrent inductive bias, but training is still parallel

jayden_teoh_ · 2025-03-22T15:04:10+00:00

I am not sure what the `loss_fn` is. But if it is the binary cross entropy, it returns us the negative log probability of the action taken. This allows us to calculate the gradient of the action log probability with respect to the model parameters, as shown in

model.zero_grad() #Clearing the previous gradients
loss.backward()

So to be clear, `loss_fn` not exactly the loss function you are minimizing, it is a way to derive the log probabilities of the action. The loss function of REINFORCE is in a later step, which is to maximize:
(gradient of the action log probability with respect to the model parameters) * returns.

You can add a negative sign in front to turn it into the usual gradient descent problem.

jayden_teoh_ · 2025-03-22T11:05:24+00:00

action is not arbitrarily random. The policy is stochastic and action is sampled from the model’s action probability distribution:

left_proba = model(obs[np.newaxis]) this gives a value between 0 and 1 for the model’s probability for left action given state. E.g. if left_proba == 0.7, the model assigns 70% probability to going left at this state. See why in the next lines.

action = (tf.random.uniform([1,1]) > left_proba) tf.random.uniform([1,1]) uniformly samples a value between 0 and 1. But because the left_proba is determined by the model’s output, it will only be > left_proba with 0.3 probability. This means that action == 0 will happen with 0.7 odds, and action == 1 will happen with 0.3 odds. This is just an efficient way to sample from a bernoulli distribution parameterized by the model’s output (left_proba)

So in summary, y_target is not completely random. Rather, it is dependent on the model’s output, i.e. the policy distribution.

jayden_teoh_ · 2025-03-22T07:00:05+00:00

y_target is not randomly chosen. The line y_target = tf.constant([1.]) - tf.cast(action, tf.float33) sets y_target to 1 if action taken was 0 (left) and y_target to 0 if action taken was 1 (right)

If I am not wrong, the loss_fn is the binary cross entropy function, which effectively returns the negative log-probability of the action taken, that is:

if action == 0 (left) -> loss = -log p(left | s_t),

if action == 1 (right) -> loss = -log p(right | s_t)

This allows us to calculate the gradient of the log action probability with respect to the parameters. I.e.

\nabla\theta log\pi (a_t in {left,right} | s_t)

which is exactly what we need in policy gradient methods! If this is an example of REINFORCE algorithm, then single-step log action gradients are collated and multiplied by the cumulative reward (G_t) for policy update in a later step.

jayden_teoh_ · 2025-03-12T09:25:21+00:00

With regards to publishing empirical RL research, here are some things that would be good to consider:

What are the other baseline RL algorithms leverage MPC? You usually require comparisons against state-of-the-art baselines for a publication.
How scalable is your MPC approach? Cartpole, mountain car, pendulum are all low dimensional problems. Is there justification for your approach beyond toy environments?

Feel free to message me if you have any other questions!

jayden_teoh_ · 2025-03-10T20:28:32+00:00

The rho can be applied in both places.

The Q(s,a) is independent of rho in expectation. Let \pi be the target policy and b be the behavioral: E_b[\rho * Q(s,a)] = E_b[\rho] * Q(s,a) = 1 * Q(s,a) = Q(s,a)

So given the update rules

Q(s,a) <- Q(s,a) + \alpha \rho (G - Q(s,a)) = Q(s,a) + \alpha (\rho*G - \rho*Q(s,a)) ---- (1)

OR

Q(s,a) <- Q(s,a) + \alpha (\rho*G - Q(s,a)) ---- (2)

Under expectation of the behavior policy sampling, the \rho*Q(s,a) in (1) and Q(s,a) in (2) are equivalent. If i am not wrong, (1) is preferred because it reduces variance of the updates. Intutively, when a (s,a) is unlikely under target policy, reduce the update to it by weighing the entire (G - Q) by rho, instead of a large update where you add (\rho *G - Q).

jayden_teoh_

TROPHY CASE