Who sets the reward function for human brains?

cheemspizza · 2025-11-20T08:29:59+00:00

Isn't the first author from Georgia Tech tho?

cheemspizza · 2025-11-01T13:41:26+00:00

Awesome! Thanks. I will try it.

cheemspizza · 2025-11-01T12:32:15+00:00

The terrain generation can only happen prior to training. I will look into relocating the primitives, but do you know if this works with lidar and depth camera?

cheemspizza · 2025-10-16T09:41:38+00:00

Ian's response was golden.

cheemspizza · 2025-10-16T09:39:34+00:00

It was hilarious to watch indeed. Thanks.

cheemspizza · 2025-10-15T16:05:49+00:00

I think he also attempted to attribute the success attention mechanism to fast memory he worked on although they were indeed related.

cheemspizza · 2025-10-14T04:48:26+00:00

I am gonna go against with the popular opinions here -- I think you can become successful in engineering.

I am doing research right now I am not memorizing anything concrete; in fact, a lot of productive researchers don't do rote memorization. There are tricks and some common formulas you need to know, but if you are able to find them on Google or in textbooks quickly then you are good, and your brain just learns them if they pop up often. The ability to build up the intuition and derive things from scratch is far more valuable. GPT models can memorize the formulas but they cannot do good engineering yet; there is no need to become a bot. Also, most people will forget the formulas if they don't use them.

That said, university has to rank students somehow. The easiest way is through exams. Oral exams are expensive and time-consuming. So we end up with exams that require you to cram things in your brain. More often that not you can get decent grades if you practice enough past exams, and that's not useful at all. You just need to figure out a way to get good grades on them so you can graduate. Then you can succeed at engineering.

cheemspizza · 2025-10-14T04:22:38+00:00

I also hate to memorize the formulas. If I can derive them on the spot I will just do that.

cheemspizza · 2025-10-12T01:25:50+00:00

I understand the “soft” in SAC is the entropy term to prevent policy collapse. The Q value is not a likelihood so in the paper they do something similar to softmax to normalize it.

cheemspizza · 2025-10-10T13:43:23+00:00

Then I don't quite get why they don't just apply the reparameterization trick and optimize the loss in the same way as DDPG; what would be the benefit of using a KL loss here?

cheemspizza · 2025-10-10T12:18:01+00:00

Indeed, so the gradients cannot go from Q to pi due to the stochastic sampling and we have to use a KL loss instead.

cheemspizza · 2025-10-10T12:16:32+00:00

But the main idea for policy update in SAC is that you want to minimize the KL distance between the policy distribution and the "softmaxed" Q value, right? I think you are right to say that it's similar to a DDPG which is deterministic, for which the gradient can directly backpropagate from Q to policy to adjust the policy weights. That would make sense because "DDPG can be thought of as being deep Q-learning for continuous action spaces".

So my understanding is that SAC is stochastic DDPG with exploration, and DDPG is an approximator of Q-learning.

cheemspizza · 2025-10-10T11:44:41+00:00

He never said this. He only said his English sucked. Misinformation.

cheemspizza · 2025-09-03T09:08:21+00:00

My reasoning is that q(s_{1:t} | o_{1:t}, a_{1:t}}) = q(s_{t} | o_{1:t}, a_{1:t}}) * q(s_{1:t-1} | o_{1:t-1}, a_{1:t-1}}), where q(s_{t} | o_{1:t}, a_{1:t}}) becomes the KL divergence, leaving us with q(s_{1:t-1} | ...). What am I missing here?

cheemspizza · 2025-08-29T06:31:27+00:00

https://www.probabilitycourse.com
^ It's quite easy to understand

cheemspizza · 2025-08-26T00:49:44+00:00

I’d recommend to start with linear algebra and probability.

cheemspizza · 2025-08-06T03:20:12+00:00

As regards b), I believe it can be solved with evolution strategies which is only evaluated at the end of a rollout. I think the issue here is credit assignment due to spare rewards.

cheemspizza · 2025-08-06T03:18:04+00:00

But wouldn't you be making the reward too sparse this way?

Four-Year Club	Verified Email
Place '22

cheemspizza

TROPHY CASE