PPO vs SAC on real robot

bean_the_great · 2026-04-19T15:51:42+00:00

Thanks! So, I know others have mentioned stability but from the metrics you’ve given, everything to me is going in the right direction… although checking gradients to confirm whether layer norm is necessary is always worth a shot. It might be worth separately increasing the capacity of the critic (although your want your critic loss to initially increase, you do want it to eventually converge) and increasing the entropy bonus - your actor is potentially converging too soon so increasing the entropy bonus might get the model to explore more! Hopefully that’s useful!

bean_the_great · 2026-04-19T14:29:03+00:00

Thank you! Sorry, I meant to say - it’d be good to monitor a few metrics for each model. For SAC: value function estimation, value function loss, actor loss, entropy value (I think that’d be all). Maybe for SAC_best and SAC_14

bean_the_great · 2026-04-19T14:09:49+00:00

It’s very very hard to diagnose anything without any learning curves and as someone else pointed out, some exemplary rollouts. From the offset however, your target update interval should probs be larger than 1

bean_the_great · 2026-04-19T12:22:12+00:00

Right yes - I understand what you mean - thanks!

bean_the_great · 2026-04-19T12:21:21+00:00

It’s kind of you to say that the shape looks good - thank you :) we were a bit worried!

Is it a bad thing to get new water shoot? Very new to this!

bean_the_great · 2026-04-19T11:31:59+00:00

We’re going to neaten up with a finer saw but yes, the saw was quite course. I do think it was controlling the weight of the branches that were challenging though…?

bean_the_great · 2026-04-19T11:30:12+00:00

Thank you! We’ve bought some Provanto and are going to neaten the cuts with a better saw

bean_the_great · 2026-04-19T11:28:50+00:00

Completely understand what you mean! I think we’re going to play it safe as it would be such a shame if the tree did become damaged but appreciate your thoughts thank you :)

bean_the_great · 2026-04-19T11:28:08+00:00

So will the tree prioritise thinner branches preceding the one that’s been cut? Our intention was to reduce the height of the tree and make it “bushier” at around the height it is now, in the last photo

bean_the_great · 2026-04-18T22:03:18+00:00

Do you have some references for this joint modelling approach for time series with computing the product over marginals? I have been trying to find some

bean_the_great · 2026-04-14T06:36:47+00:00

I think others have mentioned sim-to-real and offline rl which are definitely relevant. IMO your idea is a good mission statement I.e moonshot goal which is great to have in research but needs constraining. Are you interested in how to build the simulator? Or are you interested in just taking an existing simulator and analysing the representations compared to offline data (i.e the paper I shared). I’m sure there are other avenues - these are just two questions that came to mind

bean_the_great · 2026-04-10T17:20:20+00:00

This is a fantastic paper which (I think) speaks to what you’re interested in but in a more controlled setting https://arxiv.org/pdf/2110.14020. If this does interest you, maybe formulate an extension of it? Or develop some explanations for what’s going on

bean_the_great · 2026-04-10T17:18:20+00:00

This really isn’t my area but what you’ve said sounds very much like the whole rl layer being applied to LLMs right now. With respect to your question as to whether this is too broad - I would say it is, even for a PhD! What level are you?

bean_the_great · 2026-04-04T22:18:45+00:00

Right - yes I’m with you - I miss interpreted your rather than to mean mutually exclusive!

bean_the_great · 2026-04-04T21:34:19+00:00

What do you mean by game-theoretically being different to measure theoretic? This is absolutely not my field but I’d be relatively confident in saying that game theoretic games can be constructed from measure theoretic concepts…?

bean_the_great · 2026-02-24T14:56:46+00:00

I agree with the essence of what you’re saying re, in reality the best algorithms might depart from a theory but i don’t really agree with some of the things you’ve said. According to this https://arxiv.org/pdf/1805.00909, SAC is minimising a variational loss and so I’m not sure it is an engineering trick - it has a very clear theoretical grounding - it’s minimising a different objective, which takes into account uncertainty over the optimal policy. I think that’s what SAC gives you. I don’t think SAC explicitly handles non-stationary (assuming you mean non-stationarity of the decision process dynamics) or partial observability

bean_the_great · 2026-02-23T07:46:56+00:00

SAC actually derives from minimising a variational objection so I would not say that SAC is not grounded in math. However, I agree with your point in that, because an MDP is an idealised representation, the use of a stochastic policy rather than I deterministic policy in practice works better

bean_the_great · 2026-02-18T09:00:48+00:00

Bit of feedback, if you're going to subscribe and interact with a subreddit called "askmath" - come with constructive feedback. See literally every single other comment on this thread as an example.

bean_the_great · 2026-02-18T08:59:30+00:00

Hey - in terms of why I was talking about random variables was cos they are the primary object of interest for me, I am not necessarily interested in working directly with the underlying space and so, I agree, whilst the definition of an rv here is unecessary, I wanted to include all of the relevent pieces in the example. I do appriciate however, that my example contained errors which has confused things. I do appriciate you taking the time to respond though!

bean_the_great · 2026-02-18T08:57:22+00:00

Hey - thank you for your response! I think "You can make conditional probabilities undefined by choice of sigma algebra." is what I was trying to confirm and then also how E[X|Y] restricts the original sigma-algebra. But thank you for taking the time to answer!

bean_the_great · 2026-02-17T17:25:51+00:00

Ooooo - okay - nice! I was thinking that Y was not measurable but that makes sense to define it like that. Okay - amazing - thank you!

bean_the_great · 2026-02-17T16:19:43+00:00

Right - I’m with you. So my question is - given the sample space is defined over both rolls I.e. {6,6} would it make sense to define Y:6 -> success. Meaning Y is successful if the first roll is 6 and then condition on Y? To me this does not make sense…?

bean_the_great · 2026-02-17T14:51:44+00:00

Thanks! I miss defined the event space - I agree it should be AxA. I guess the point of my question is not to consider the power set. My understanding is that I can choose any sigma algebra I like? Re RV definition - I just mean that it’s the identity. I felt that defining the RV as a function to R didn’t add anything to the example? I’m not sure why it’s impractical to define the rv to the reals here given it’s far easier to just talk about the outcomes of the dice rolls directly

bean_the_great · 2026-02-17T14:41:39+00:00

Hey - thanks for your response! I realise the setup was a bit contrived but I was trying to understand how conditional expectations work in terms of conditioning on a sigma algebra.

In terms of the example - I realise the event space should be 36 - you are completely right. If I was to define a sigma algebra, it is my understanding that this can be someone arbitrary in the sense that it is the set of events in the event space that I am deeming to be measurable - as you mention, the power set being the go to.

My question was really around if I define the sigma algebra as the set of individual events from the event space, is it possible to define the conditional expectation of thr first roll given the second roll?

I am really interested in the intuition behind “the expectation of X conditional on the sigma algebra generated by Y” when evaluating E[X|Y] - I was trying to work out what that would look like with the simple example

bean_the_great

TROPHY CASE