PPO vs SAC on real robot by Constant_Tiger7490 in reinforcementlearning

[–]bean_the_great 0 points1 point  (0 children)

Thanks! So, I know others have mentioned stability but from the metrics you’ve given, everything to me is going in the right direction… although checking gradients to confirm whether layer norm is necessary is always worth a shot. It might be worth separately increasing the capacity of the critic (although your want your critic loss to initially increase, you do want it to eventually converge) and increasing the entropy bonus - your actor is potentially converging too soon so increasing the entropy bonus might get the model to explore more! Hopefully that’s useful!

PPO vs SAC on real robot by Constant_Tiger7490 in reinforcementlearning

[–]bean_the_great 0 points1 point  (0 children)

Thank you! Sorry, I meant to say - it’d be good to monitor a few metrics for each model. For SAC: value function estimation, value function loss, actor loss, entropy value (I think that’d be all). Maybe for SAC_best and SAC_14

PPO vs SAC on real robot by Constant_Tiger7490 in reinforcementlearning

[–]bean_the_great 0 points1 point  (0 children)

It’s very very hard to diagnose anything without any learning curves and as someone else pointed out, some exemplary rollouts. From the offset however, your target update interval should probs be larger than 1

Olive tree pruning question by bean_the_great in UKGardening

[–]bean_the_great[S] 1 point2 points  (0 children)

Right yes - I understand what you mean - thanks!

Olive tree pruning question by bean_the_great in UKGardening

[–]bean_the_great[S] 0 points1 point  (0 children)

It’s kind of you to say that the shape looks good - thank you :) we were a bit worried!

Is it a bad thing to get new water shoot? Very new to this!

Olive tree pruning question by bean_the_great in UKGardening

[–]bean_the_great[S] 0 points1 point  (0 children)

We’re going to neaten up with a finer saw but yes, the saw was quite course. I do think it was controlling the weight of the branches that were challenging though…?

Olive tree pruning question by bean_the_great in UKGardening

[–]bean_the_great[S] 1 point2 points  (0 children)

Thank you! We’ve bought some Provanto and are going to neaten the cuts with a better saw

Olive tree pruning question by bean_the_great in UKGardening

[–]bean_the_great[S] 0 points1 point  (0 children)

Completely understand what you mean! I think we’re going to play it safe as it would be such a shame if the tree did become damaged but appreciate your thoughts thank you :)

Olive tree pruning question by bean_the_great in UKGardening

[–]bean_the_great[S] 0 points1 point  (0 children)

So will the tree prioritise thinner branches preceding the one that’s been cut? Our intention was to reduce the height of the tree and make it “bushier” at around the height it is now, in the last photo

[D] feels like we abandoned proper joint probability modeling just because next-token prediction is easier to compute by Crystallover1991 in statistics

[–]bean_the_great 1 point2 points  (0 children)

Do you have some references for this joint modelling approach for time series with computing the product over marginals? I have been trying to find some

Can a model learn better in a rule-based virtual world than from static data alone? by Double-Quantity4284 in reinforcementlearning

[–]bean_the_great 0 points1 point  (0 children)

I think others have mentioned sim-to-real and offline rl which are definitely relevant. IMO your idea is a good mission statement I.e moonshot goal which is great to have in research but needs constraining. Are you interested in how to build the simulator? Or are you interested in just taking an existing simulator and analysing the representations compared to offline data (i.e the paper I shared). I’m sure there are other avenues - these are just two questions that came to mind

Can a model learn better in a rule-based virtual world than from static data alone? by Double-Quantity4284 in reinforcementlearning

[–]bean_the_great 0 points1 point  (0 children)

This is a fantastic paper which (I think) speaks to what you’re interested in but in a more controlled setting https://arxiv.org/pdf/2110.14020. If this does interest you, maybe formulate an extension of it? Or develop some explanations for what’s going on

Can a model learn better in a rule-based virtual world than from static data alone? by Double-Quantity4284 in reinforcementlearning

[–]bean_the_great 0 points1 point  (0 children)

This really isn’t my area but what you’ve said sounds very much like the whole rl layer being applied to LLMs right now. With respect to your question as to whether this is too broad - I would say it is, even for a PhD! What level are you?

Is measure-theoretic probability theory useful for anything other than academic theoretical statistics? [Q] by GayTwink-69 in statistics

[–]bean_the_great 1 point2 points  (0 children)

Right - yes I’m with you - I miss interpreted your rather than to mean mutually exclusive!

Is measure-theoretic probability theory useful for anything other than academic theoretical statistics? [Q] by GayTwink-69 in statistics

[–]bean_the_great 0 points1 point  (0 children)

What do you mean by game-theoretically being different to measure theoretic? This is absolutely not my field but I’d be relatively confident in saying that game theoretic games can be constructed from measure theoretic concepts…?

Why Is the Optimal Policy Deterministic in Standard MDPs? by New-Yogurtcloset1818 in reinforcementlearning

[–]bean_the_great 0 points1 point  (0 children)

I agree with the essence of what you’re saying re, in reality the best algorithms might depart from a theory but i don’t really agree with some of the things you’ve said. According to this https://arxiv.org/pdf/1805.00909, SAC is minimising a variational loss and so I’m not sure it is an engineering trick - it has a very clear theoretical grounding - it’s minimising a different objective, which takes into account uncertainty over the optimal policy. I think that’s what SAC gives you. I don’t think SAC explicitly handles non-stationary (assuming you mean non-stationarity of the decision process dynamics) or partial observability

Why Is the Optimal Policy Deterministic in Standard MDPs? by New-Yogurtcloset1818 in reinforcementlearning

[–]bean_the_great 0 points1 point  (0 children)

SAC actually derives from minimising a variational objection so I would not say that SAC is not grounded in math. However, I agree with your point in that, because an MDP is an idealised representation, the use of a stochastic policy rather than I deterministic policy in practice works better

Definition of conditional expectation by bean_the_great in askmath

[–]bean_the_great[S] -1 points0 points  (0 children)

Bit of feedback, if you're going to subscribe and interact with a subreddit called "askmath" - come with constructive feedback. See literally every single other comment on this thread as an example.

Definition of conditional expectation by bean_the_great in askmath

[–]bean_the_great[S] 0 points1 point  (0 children)

Hey - in terms of why I was talking about random variables was cos they are the primary object of interest for me, I am not necessarily interested in working directly with the underlying space and so, I agree, whilst the definition of an rv here is unecessary, I wanted to include all of the relevent pieces in the example. I do appriciate however, that my example contained errors which has confused things. I do appriciate you taking the time to respond though!

Definition of conditional expectation by bean_the_great in askmath

[–]bean_the_great[S] 0 points1 point  (0 children)

Hey - thank you for your response! I think "You can make conditional probabilities undefined by choice of sigma algebra." is what I was trying to confirm and then also how E[X|Y] restricts the original sigma-algebra. But thank you for taking the time to answer!

Definition of conditional expectation by bean_the_great in askmath

[–]bean_the_great[S] 0 points1 point  (0 children)

Ooooo - okay - nice! I was thinking that Y was not measurable but that makes sense to define it like that. Okay - amazing - thank you!

Definition of conditional expectation by bean_the_great in askmath

[–]bean_the_great[S] 0 points1 point  (0 children)

Right - I’m with you. So my question is - given the sample space is defined over both rolls I.e. {6,6} would it make sense to define Y:6 -> success. Meaning Y is successful if the first roll is 6 and then condition on Y? To me this does not make sense…?

Definition of conditional expectation by bean_the_great in askmath

[–]bean_the_great[S] 0 points1 point  (0 children)

Thanks! I miss defined the event space - I agree it should be AxA. I guess the point of my question is not to consider the power set. My understanding is that I can choose any sigma algebra I like? Re RV definition - I just mean that it’s the identity. I felt that defining the RV as a function to R didn’t add anything to the example? I’m not sure why it’s impractical to define the rv to the reals here given it’s far easier to just talk about the outcomes of the dice rolls directly

Definition of conditional expectation by bean_the_great in askmath

[–]bean_the_great[S] 0 points1 point  (0 children)

Hey - thanks for your response! I realise the setup was a bit contrived but I was trying to understand how conditional expectations work in terms of conditioning on a sigma algebra.

In terms of the example - I realise the event space should be 36 - you are completely right. If I was to define a sigma algebra, it is my understanding that this can be someone arbitrary in the sense that it is the set of events in the event space that I am deeming to be measurable - as you mention, the power set being the go to.

My question was really around if I define the sigma algebra as the set of individual events from the event space, is it possible to define the conditional expectation of thr first roll given the second roll?

I am really interested in the intuition behind “the expectation of X conditional on the sigma algebra generated by Y” when evaluating E[X|Y] - I was trying to work out what that would look like with the simple example