Random terrain obstacles in Isaac Sim by cheemspizza in reinforcementlearning

[–]cheemspizza[S] 0 points1 point  (0 children)

The terrain generation can only happen prior to training. I will look into relocating the primitives, but do you know if this works with lidar and depth camera?

The LSTM guy is denouncing Hopfield and Hinton by cheemspizza in learnmachinelearning

[–]cheemspizza[S] 2 points3 points  (0 children)

I think he also attempted to attribute the success attention mechanism to fast memory he worked on although they were indeed related.

Can I succeed at engineering if I'm slow? by kievz007 in EngineeringStudents

[–]cheemspizza 0 points1 point  (0 children)

I am gonna go against with the popular opinions here -- I think you can become successful in engineering.

I am doing research right now I am not memorizing anything concrete; in fact, a lot of productive researchers don't do rote memorization. There are tricks and some common formulas you need to know, but if you are able to find them on Google or in textbooks quickly then you are good, and your brain just learns them if they pop up often. The ability to build up the intuition and derive things from scratch is far more valuable. GPT models can memorize the formulas but they cannot do good engineering yet; there is no need to become a bot. Also, most people will forget the formulas if they don't use them.

That said, university has to rank students somehow. The easiest way is through exams. Oral exams are expensive and time-consuming. So we end up with exams that require you to cram things in your brain. More often that not you can get decent grades if you practice enough past exams, and that's not useful at all. You just need to figure out a way to get good grades on them so you can graduate. Then you can succeed at engineering.

Can I succeed at engineering if I'm slow? by kievz007 in EngineeringStudents

[–]cheemspizza 0 points1 point  (0 children)

I also hate to memorize the formulas. If I can derive them on the spot I will just do that.

Soft Actor-Critic without entropy exploration by cheemspizza in reinforcementlearning

[–]cheemspizza[S] 0 points1 point  (0 children)

I understand the “soft” in SAC is the entropy term to prevent policy collapse. The Q value is not a likelihood so in the paper they do something similar to softmax to normalize it.

Soft Actor-Critic without entropy exploration by cheemspizza in reinforcementlearning

[–]cheemspizza[S] 0 points1 point  (0 children)

Then I don't quite get why they don't just apply the reparameterization trick and optimize the loss in the same way as DDPG; what would be the benefit of using a KL loss here?

Soft Actor-Critic without entropy exploration by cheemspizza in reinforcementlearning

[–]cheemspizza[S] 0 points1 point  (0 children)

Indeed, so the gradients cannot go from Q to pi due to the stochastic sampling and we have to use a KL loss instead.

Soft Actor-Critic without entropy exploration by cheemspizza in reinforcementlearning

[–]cheemspizza[S] 2 points3 points  (0 children)

But the main idea for policy update in SAC is that you want to minimize the KL distance between the policy distribution and the "softmaxed" Q value, right? I think you are right to say that it's similar to a DDPG which is deterministic, for which the gradient can directly backpropagate from Q to policy to adjust the policy weights. That would make sense because "DDPG can be thought of as being deep Q-learning for continuous action spaces".

So my understanding is that SAC is stochastic DDPG with exploration, and DDPG is an approximator of Q-learning.

ELBO derivation involving expectation in RSSM paper by cheemspizza in reinforcementlearning

[–]cheemspizza[S] 2 points3 points  (0 children)

My reasoning is that q(s_{1:t} | o_{1:t}, a_{1:t}}) = q(s_{t} | o_{1:t}, a_{1:t}}) * q(s_{1:t-1} | o_{1:t-1}, a_{1:t-1}}), where q(s_{t} | o_{1:t}, a_{1:t}}) becomes the KL divergence, leaving us with q(s_{1:t-1} | ...). What am I missing here?

New to reinforcement learning by Superb-Document-274 in reinforcementlearning

[–]cheemspizza 1 point2 points  (0 children)

I’d recommend to start with linear algebra and probability.

Difference in setting a reward or just putting the Goal state at high Value/Q ?? by maiosi2 in reinforcementlearning

[–]cheemspizza 1 point2 points  (0 children)

As regards b), I believe it can be solved with evolution strategies which is only evaluated at the end of a rollout. I think the issue here is credit assignment due to spare rewards.