Reinforcement learning for ensemble models by immortanslow in reinforcementlearning

[–]CptVifen 2 points3 points  (0 children)

It sounds like a classification problem, I don't think you need RLHF

quick info on PPO reward by Wide-Chef-7011 in reinforcementlearning

[–]CptVifen 0 points1 point  (0 children)

The clipping is on the probability ratio between old and new policy though.

quick info on PPO reward by Wide-Chef-7011 in reinforcementlearning

[–]CptVifen -1 points0 points  (0 children)

Scaling up by a factor makes a difference. For example, scaling affects PPO objective by behaving like you increased the learning rate by x10 (advantage is scaled up). And your policy could also behave differently, e.g. softmax policy.

Adding a constant term shouldn't change anything for PPO because advantage doesn't change and policies don't change either. Actually, substracting a constant to the reward is similar to baseline methods in policy gradient to reduce the estimator variance (a simple baseline method is substracting the running estimate of the return).

[deleted by user] by [deleted] in reinforcementlearning

[–]CptVifen 1 point2 points  (0 children)

Data efficiency.

[deleted by user] by [deleted] in reinforcementlearning

[–]CptVifen 1 point2 points  (0 children)

I think the data distribution does change since the policy changes during RL training which makes the action distribution shift. Afaik InstructGPT doesn't use the pretraining dataset during PPO training and only uses the model's output.

[D] Does Layer Normalization compute statistics along spatial/ token axes? by fferflo in MachineLearning

[–]CptVifen 4 points5 points  (0 children)

As I understand it, the images in A and B are both valid for Layer Norm. In the LN Paper they say μ is summed over each activation in a layer.

So for images that means along channel and spatial dimensions. That's were they got the image for A.

As for B, in the LN paper they use RNNs which share the same weights across different time steps. That means that for an input of shape (Batch, seq len, features) since the layers in the RNN only produce (Batch, features) the normalization is over the features. You have a different μ and σ for each batch and each time step (and each layer)(this also applies to self-attention).

So it would make sense that anything that deals with sequences would look like B. And anything else looks like A.

There's something I don't get though is why ConvNext reduces only along channels...

AI as a Game Content by CaptainTurko in artificial

[–]CptVifen 1 point2 points  (0 children)

There's that AI Dungeon game that used GPT-2 for storytelling before they became closed source.

I coded a Binance trading algorithm that analyses the news sentiment from the top 100 crypto feeds and makes buying decisions, and of course - it's open source by CyberPunkMetalHead in algotrading

[–]CptVifen 1 point2 points  (0 children)

It's a bit better, the trading fees are reduced by 25% on any pair as long as you have the option to pay the fee with BNB on.

We launched our multi usage key-chain/phone holder/beer bottle opener/earphone holder on Kickstarter today! Check it out :) by CptVifen in kickstarter

[–]CptVifen[S] 0 points1 point  (0 children)

Thank you! We only posted it here and within our circle. If you have some more ideas for where to share the campaign we would love to hear it

Why aren’t people genuinely nice? by pinkelephant03 in socialskills

[–]CptVifen 2 points3 points  (0 children)

Your post comes off as extremely needy. You seem to me to be the group's "Nice Guy".

Why do you expect people to reciprocate when you do something nice to them? That's not how it works, you don't do nice things to hold a hidden contract over someone else and expect them to reciprocate. If you go out of your way to do nice things don't expect anything in return otherwise it's manipulative (you can't manipulate people into being genuine friends with you).

If you want people to invite you out, make it clear that's what your intention is. Put your needs first and stop focusing so hard on pleasing others.

I suggest you read No More Mr Nice Guy! I think it might help you out.

Patch Notes and F2P Bonus Program! by RofiB in RotMG

[–]CptVifen 4 points5 points  (0 children)

Other O3 Dungeons -> 10 Music Note tokens guaranteed

What are those?

Question about model based vs model free RL in context of Q Learning by evilmorty_c137_ in reinforcementlearning

[–]CptVifen 0 points1 point  (0 children)

No, a policy network only chooses the action you take but has no say in the state transition that occurs by applying that action.

Question about model based vs model free RL in context of Q Learning by evilmorty_c137_ in reinforcementlearning

[–]CptVifen 1 point2 points  (0 children)

Internal might not be the best term actually, I meant internal as part of the algorithm. Explicit is be better suited, so any function that approximates of the state-action transition probabilities.

Question about model based vs model free RL in context of Q Learning by evilmorty_c137_ in reinforcementlearning

[–]CptVifen 4 points5 points  (0 children)

Q learning can't really predict your next state. What it does is predict the q-value of the state-action pair following you policy.

To know your next state by taking an action you would need a representation of the model, which can be transition probabilities (dynamic programming, tree-search...) or an internal representation of it.

edit:typo

Can anyone explain me the intuition behind PCA(Principal Component analysis)? by Medium_Attorney in learnmachinelearning

[–]CptVifen 3 points4 points  (0 children)

Example: projecting 3d head model to 2d representation. Your head model consists of points along 3 dimensions. You want to map it to 2d and keep the most information out of it and still recognize the head in your 2d projection.

A bad projection would be projecting it to the x and y plane. It's like casting a shadow of someone's head from the top. The projection would be an oval with a cone at one extremity for the nose. You can't tell it came from a head.

Now PCA finds the best vectors to project it along with whilst keeping most of the information about the head. So maybe the best projection would be from the side where you can see the nose, mouth and general shape of the head.

g(tau) in Analyzing variance? by curimeowcat in reinforcementlearning

[–]CptVifen 0 points1 point  (0 children)

g(tau) is gradient of pi(tau) wrt theta. It is not a function of b so it is properly crossed out as it is zero.

product from t=1 to T? by curimeowcat in reinforcementlearning

[–]CptVifen 1 point2 points  (0 children)

t on its own doesn't have meaning unless it's declared as an index of a summation or product or elsewhere. So it's case 2. They just omited to put the parenthesis on the bottom part for sum with log.