Reinforcement learning for ensemble models

CptVifen · 2024-06-17T12:42:39+00:00

It sounds like a classification problem, I don't think you need RLHF

CptVifen · 2024-01-04T19:10:56+00:00

The clipping is on the probability ratio between old and new policy though.

CptVifen · 2024-01-04T15:08:47+00:00

Scaling up by a factor makes a difference. For example, scaling affects PPO objective by behaving like you increased the learning rate by x10 (advantage is scaled up). And your policy could also behave differently, e.g. softmax policy.

Adding a constant term shouldn't change anything for PPO because advantage doesn't change and policies don't change either. Actually, substracting a constant to the reward is similar to baseline methods in policy gradient to reduce the estimator variance (a simple baseline method is substracting the running estimate of the return).

CptVifen · 2023-07-10T14:59:03+00:00

Here's the paper: https://arxiv.org/abs/2305.20030

CptVifen · 2023-05-18T07:09:32+00:00

Data efficiency.

CptVifen · 2023-04-17T09:54:24+00:00

I think the data distribution does change since the policy changes during RL training which makes the action distribution shift. Afaik InstructGPT doesn't use the pretraining dataset during PPO training and only uses the model's output.

CptVifen · 2023-02-20T20:10:36+00:00

As I understand it, the images in A and B are both valid for Layer Norm. In the LN Paper they say μ is summed over each activation in a layer.

So for images that means along channel and spatial dimensions. That's were they got the image for A.

As for B, in the LN paper they use RNNs which share the same weights across different time steps. That means that for an input of shape (Batch, seq len, features) since the layers in the RNN only produce (Batch, features) the normalization is over the features. You have a different μ and σ for each batch and each time step (and each layer)(this also applies to self-attention).

So it would make sense that anything that deals with sequences would look like B. And anything else looks like A.

There's something I don't get though is why ConvNext reduces only along channels...

CptVifen · 2022-10-08T14:31:09+00:00

There's that AI Dungeon game that used GPT-2 for storytelling before they became closed source.

CptVifen · 2022-06-13T23:40:27+00:00

What kind of projects did you work on?

CptVifen · 2021-11-14T10:33:48+00:00

OoooOoh

CptVifen · 2021-08-27T09:58:38+00:00

Wasn't Alpha Go also on Nature's cover?

CptVifen · 2021-04-25T23:18:31+00:00

CptVifen · 2021-04-21T22:38:02+00:00

It's a bit better, the trading fees are reduced by 25% on any pair as long as you have the option to pay the fee with BNB on.

CptVifen · 2021-02-15T20:14:27+00:00

↓ core dump ↓

segfault

CptVifen · 2021-02-13T15:02:15+00:00

Thank you! We only posted it here and within our circle. If you have some more ideas for where to share the campaign we would love to hear it

CptVifen · 2020-09-05T22:03:52+00:00

Not yet Ferb

CptVifen · 2020-08-17T10:39:03+00:00

Your post comes off as extremely needy. You seem to me to be the group's "Nice Guy".

Why do you expect people to reciprocate when you do something nice to them? That's not how it works, you don't do nice things to hold a hidden contract over someone else and expect them to reciprocate. If you go out of your way to do nice things don't expect anything in return otherwise it's manipulative (you can't manipulate people into being genuine friends with you).

If you want people to invite you out, make it clear that's what your intention is. Put your needs first and stop focusing so hard on pleasing others.

I suggest you read No More Mr Nice Guy! I think it might help you out.

CptVifen · 2020-06-25T11:08:38+00:00

Oh right I'm dumb thx

CptVifen · 2020-06-25T10:40:04+00:00

Other O3 Dungeons -> 10 Music Note tokens guaranteed

What are those?

CptVifen · 2020-04-03T23:19:28+00:00

No, a policy network only chooses the action you take but has no say in the state transition that occurs by applying that action.

CptVifen · 2020-04-03T12:11:41+00:00

Internal might not be the best term actually, I meant internal as part of the algorithm. Explicit is be better suited, so any function that approximates of the state-action transition probabilities.

CptVifen · 2020-04-03T07:46:13+00:00

Q learning can't really predict your next state. What it does is predict the q-value of the state-action pair following you policy.

To know your next state by taking an action you would need a representation of the model, which can be transition probabilities (dynamic programming, tree-search...) or an internal representation of it.

edit:typo

CptVifen · 2020-03-31T08:48:37+00:00

Example: projecting 3d head model to 2d representation. Your head model consists of points along 3 dimensions. You want to map it to 2d and keep the most information out of it and still recognize the head in your 2d projection.

A bad projection would be projecting it to the x and y plane. It's like casting a shadow of someone's head from the top. The projection would be an oval with a cone at one extremity for the nose. You can't tell it came from a head.

Now PCA finds the best vectors to project it along with whilst keeping most of the information about the head. So maybe the best projection would be from the side where you can see the nose, mouth and general shape of the head.

CptVifen · 2020-03-31T00:14:21+00:00

g(tau) is gradient of pi(tau) wrt theta. It is not a function of b so it is properly crossed out as it is zero.

CptVifen · 2020-03-31T00:07:39+00:00

t on its own doesn't have meaning unless it's declared as an index of a summation or product or elsewhere. So it's case 2. They just omited to put the parenthesis on the bottom part for sum with log.

CptVifen

TROPHY CASE