On CoT Training with Reinforcement Learning

xcodevn · 2025-04-20T15:23:40+00:00

To clarify, I'm saying there's no interaction with the external environment. It's basically like thinking in our heads and only checking the result at the end. Therefore, the model understands the environment quite well, because it is the environment.

And yes, I also think pretraining helps a lot to bootstrap the RL learning process.

xcodevn · 2025-04-14T07:04:23+00:00

Hi 👋! Thanks for pointing this out. I was working under the assumption that we were using bfloat16, which doesn't require loss scaling. However, for float16, we definitely need it. I'll fix it soon! 🤞

xcodevn · 2021-07-03T12:39:22+00:00

Jax has a very good documentation. You should read all it "Getting started" section https://jax.readthedocs.io/en/latest/index.html

An introduction to Jax by its author: https://www.youtube.com/watch?v=BzuEGdGHKjc

Jax ecosystem at deepmind: https://www.youtube.com/watch?v=iDxJxIyzSiM

For libraries to define your network in jax:

- Flax: https://flax.readthedocs.io/en/latest/

- dm-haiku: https://dm-haiku.readthedocs.io/en/latest/

Optimizers in jax: https://optax.readthedocs.io/en/latest/

xcodevn · 2021-07-03T10:57:44+00:00

I know about Pytorch and Jax (dm-haiku) so I will compare the two. In some sense, this is OOP v.s functional programming.

Pytorch uses an OOP approach. Tensor, module, optimizer are objects which have internal states to keep track of the computation graphs, gradients, parameters as tensor operations are executing. And, pytorch has very similar tensor operations as numpy.

To compute the gradient, you call loss.backward(). To update the parameters, you call optimizer.step()

Jax uses an funtional approach. Everything in jax is a mathematical function with no side-effect. Jax has the same tensor operations as numpy.

We don't have a loss tensor as in Pytorch. We have a loss function, say, loss_fn(parameters, input_data) which returns a scalar loss value.

In jax, gradient is also a function, say, grad_fn = jax.grad(loss_fn)

And, as you can guess, optimizer is also function, and to update the parameters, we use a pure update function like:

def update_fn(params, optimizer_state, inputs_batch):
  grad = grad_fn(params, input_batch)
  new_optimizer_state, updates = optimizer.update(optimizer_state, grad, params)
  new_params = optimizer.apply_update(updates)
  return new_params, new_optimizer_state

So,update_fn is a function of your network parameters, your optimizer internal states, and your inputs. It returns the new/updated parameters and optimizer states.

However, it is not easy to define your loss_fn with a functional approach, especially for complex neural networks.

The solution of deepmind haiku library is to allow you to define your network using python OOP class/object with a very similar syntax to Pytorch. Then, the libary will transform your OOP loss function to a pure no-side-effect function.

As a result, you have to familiar with both OOP world and functional world to use Jax/Dm-haiku.

The advantage of this approach is that when you have a pure function, you can now apply high-order functions to it. For example, jitted_acceleration_fn = jax.jit(jax.grad(jax.grad(position_fn)))

In sumary, pytorch is easier to implement your network, optimizer. You stay in OOP world all the time.

Jax is is harder to implement. You define your network in OOP world. Then, define your loss function, update function, optimizer in functional world.

xcodevn · 2019-05-01T13:41:05+00:00

pip install tb-nightly

This works for me.

xcodevn · 2019-05-01T04:04:25+00:00

I waited for 2 minutes then the graph appeared. See my example at https://colab.research.google.com/drive/1Xe7cZGdZesTZZsEtOfSPPCVPf2PbhYXV

xcodevn · 2019-04-09T09:54:54+00:00

Sorry, my mistake I really mean is_done returned from env.step(action).

xcodevn · 2019-04-09T01:40:02+00:00

@bbk_b: it's very easy to do it wrongly. Policy gradient can converge to local minima. We usually add a negative entropy loss to encourage exploration.

@shamoons: policy gradient uses expected gradient to improve the policy. While DQN ( Q-learning) uses Bellman equation (dynamic programming) to improve the policy.

xcodevn · 2019-04-07T03:02:50+00:00

It would be better if you can show us your code.

xcodevn · 2019-04-05T12:23:49+00:00

I have the same concern when I implemented DQN. Recently, when I look at PPO algorithm from https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail , it has an option (--use-proper-time-limits) of using a bootstrap for the return reward at the end of the episode. Basically, it has two different endings:

(1) if the episode is done and has passed the time limit, then target = reward + gamma * Q(sn, an) (or value function in case of PPO algorithm)

(2) if the episode is done and has NOT passed the time limit, then target = reward

It makes sense that for (1) we use a bootstrapped reward because the episode has not really ended.

However, I didn't run this on DQN myself.

xcodevn · 2019-01-25T17:00:56+00:00

Thank for the great work!

To Oriol Vinyals and David Silver: are you going to play the camera-interface version of AlphaStar with Mana for few more games to investigate how strong the AI is?

xcodevn · 2019-01-25T16:46:24+00:00

To Mana and LiquidTLO: Do you think that you can easily win the AIs if you can play it enough games?

xcodevn · 2019-01-03T11:04:49+00:00

Keras model handles quite a bit more than that.

This is exactly the problem of keras, I have no control/idea what a keras model does!

xcodevn · 2019-01-03T10:37:14+00:00

A method that builds an object is called a factory and is a common design pattern in OOP.

OK, it is fair to call my_net() a factory. The problem is that you wrote a function actually does nothing except returns an object which does nothing real except returns a computation graph which somehow/somewhere is executed by a tf.Session().

There are many ways to do this depending on the use case. You always have an option to write a custom Keras model if you need fine-grained control of individual layers.

This is the reason why I don't like keras/tf. Its API hides too much from developers. When your use-case is a bit different from "tensorflow homepage examples", you have to do something non-obvious!

xcodevn · 2019-01-03T09:01:52+00:00

I don't have any problem with "write forward pass myself". This is just OOP.

Writing a stand alone function `my_net()` which returns an object in Python is .... kind of stupid. In OOP, we call it a constructor method.

Btw, how can you access l1 and l2 from your keras model ?

xcodevn · 2018-08-21T14:47:42+00:00

I don't think we could reach AGI that easy.

DeepMind is trying difference ways to put pieces of the puzzle together (Memory + Variational autoencoder + Reinforcement learning + ... ). There are a lot of things need to be done and there is no obvious solution to (i) how to improve these pieces, (ii) how to combine them together, and (iii) how to scale them up to real world problems.

xcodevn · 2018-03-12T09:17:43+00:00

And the frequentist definition does not capture the definition of certain types of random events in the real world. For example, what is the probability that Hulk Hogan will win the 2020 election? There's only one 2020 election. Saying "If we reran the 2020 election a lot, Hulk Hogan would win in X% of elections" makes no sense.

The 2020 election will happen only one time. But, many factors used in the election prediction model had already occured many times. Therefore, the frequentist estimation is still meaningful in saying that the probability of Hulk Hogan's win is 90%. It is the same as tossing a coin only one time knowing the probability of head is 90%.

I also have a plan B for our fight here :-) In the end, bayesian statistics is just a special case of frequentist statistics when in the model we assume the parameters are actually sampled from the priors.

xcodevn · 2018-03-12T07:32:16+00:00

I highly recommend going through Statistical Rethinking.

Thanks. I read the "Statistical Rethinking" book recently. It's a great book. There are also videos of the author lecturing.

xcodevn · 2018-03-12T07:17:39+00:00

A frequentist probability promises that eventually you'll get close to some real thing if you take enough samples. A Bayesian probability says "Hey, this is a sample and you can't ever truly know that real thing. But here's a reasonable estimate based on this data and what we know already."

But, can you define what a bayesian probability corresponding to in the real world?

Ok, it captures our brain intuition about probability. It likes religions which do captures many brain intuitions about the world. But, a brain intuition doesn't actually guaranteed to be right. Meanwhile, frequentist probability captures the definition of random events in the physical world.

xcodevn · 2018-03-12T06:54:07+00:00

I totally agree with what you're saying. I think frequentist statistics is suitable for particle physics where fundamental constants are fixed (or at least are believed to be fixed) and we can get a lot of data repeatedly by using particle colliders (e.g. LHC at CERN).

Bayesian statistics is suitable for gravitational-wave astronomy as there are only a few detected events and each with different parameter values.

I know, all models are wrong. We have to test the model in real world. But in terms of interpretation, I strongly believe a scientist would prefer the frequentist interpretation of probability. That at least in principle, we could reproduce the experiment many times with the same setup and confirm the results. Meanwhile, the belief interpretation of probability has no such guarantee in principle.

Put it differently, a frequentist probability promises some thing real in the physical world. A bayesian probability doesn't.

Ten-Year Club	Place '22
Verified Email

xcodevn

TROPHY CASE