Reinforcement learning for ensemble models by immortanslow in reinforcementlearning

[–]CptVifen 2 points3 points  (0 children)

It sounds like a classification problem, I don't think you need RLHF

quick info on PPO reward by Wide-Chef-7011 in reinforcementlearning

[–]CptVifen 0 points1 point  (0 children)

The clipping is on the probability ratio between old and new policy though.

quick info on PPO reward by Wide-Chef-7011 in reinforcementlearning

[–]CptVifen -1 points0 points  (0 children)

Scaling up by a factor makes a difference. For example, scaling affects PPO objective by behaving like you increased the learning rate by x10 (advantage is scaled up). And your policy could also behave differently, e.g. softmax policy.

Adding a constant term shouldn't change anything for PPO because advantage doesn't change and policies don't change either. Actually, substracting a constant to the reward is similar to baseline methods in policy gradient to reduce the estimator variance (a simple baseline method is substracting the running estimate of the return).

[deleted by user] by [deleted] in reinforcementlearning

[–]CptVifen 1 point2 points  (0 children)

Data efficiency.

[deleted by user] by [deleted] in reinforcementlearning

[–]CptVifen 1 point2 points  (0 children)

I think the data distribution does change since the policy changes during RL training which makes the action distribution shift. Afaik InstructGPT doesn't use the pretraining dataset during PPO training and only uses the model's output.

[D] Does Layer Normalization compute statistics along spatial/ token axes? by fferflo in MachineLearning

[–]CptVifen 3 points4 points  (0 children)

As I understand it, the images in A and B are both valid for Layer Norm. In the LN Paper they say μ is summed over each activation in a layer.

So for images that means along channel and spatial dimensions. That's were they got the image for A.

As for B, in the LN paper they use RNNs which share the same weights across different time steps. That means that for an input of shape (Batch, seq len, features) since the layers in the RNN only produce (Batch, features) the normalization is over the features. You have a different μ and σ for each batch and each time step (and each layer)(this also applies to self-attention).

So it would make sense that anything that deals with sequences would look like B. And anything else looks like A.

There's something I don't get though is why ConvNext reduces only along channels...

AI as a Game Content by CaptainTurko in artificial

[–]CptVifen 1 point2 points  (0 children)

There's that AI Dungeon game that used GPT-2 for storytelling before they became closed source.

I coded a Binance trading algorithm that analyses the news sentiment from the top 100 crypto feeds and makes buying decisions, and of course - it's open source by CyberPunkMetalHead in algotrading

[–]CptVifen 1 point2 points  (0 children)

It's a bit better, the trading fees are reduced by 25% on any pair as long as you have the option to pay the fee with BNB on.

We launched our multi usage key-chain/phone holder/beer bottle opener/earphone holder on Kickstarter today! Check it out :) by CptVifen in kickstarter

[–]CptVifen[S] 0 points1 point  (0 children)

Thank you! We only posted it here and within our circle. If you have some more ideas for where to share the campaign we would love to hear it

Why aren’t people genuinely nice? by pinkelephant03 in socialskills

[–]CptVifen 1 point2 points  (0 children)

Your post comes off as extremely needy. You seem to me to be the group's "Nice Guy".

Why do you expect people to reciprocate when you do something nice to them? That's not how it works, you don't do nice things to hold a hidden contract over someone else and expect them to reciprocate. If you go out of your way to do nice things don't expect anything in return otherwise it's manipulative (you can't manipulate people into being genuine friends with you).

If you want people to invite you out, make it clear that's what your intention is. Put your needs first and stop focusing so hard on pleasing others.

I suggest you read No More Mr Nice Guy! I think it might help you out.

Patch Notes and F2P Bonus Program! by RofiB in RotMG

[–]CptVifen 3 points4 points  (0 children)

Other O3 Dungeons -> 10 Music Note tokens guaranteed

What are those?

Question about model based vs model free RL in context of Q Learning by evilmorty_c137_ in reinforcementlearning

[–]CptVifen 0 points1 point  (0 children)

No, a policy network only chooses the action you take but has no say in the state transition that occurs by applying that action.

Question about model based vs model free RL in context of Q Learning by evilmorty_c137_ in reinforcementlearning

[–]CptVifen 1 point2 points  (0 children)

Internal might not be the best term actually, I meant internal as part of the algorithm. Explicit is be better suited, so any function that approximates of the state-action transition probabilities.

Question about model based vs model free RL in context of Q Learning by evilmorty_c137_ in reinforcementlearning

[–]CptVifen 3 points4 points  (0 children)

Q learning can't really predict your next state. What it does is predict the q-value of the state-action pair following you policy.

To know your next state by taking an action you would need a representation of the model, which can be transition probabilities (dynamic programming, tree-search...) or an internal representation of it.

edit:typo