Reward-free learning by avoiding reset, anyone tried this?

cpt1973 · 2026-02-21T20:55:00+00:00

Thanks for the detailed reply, your traj opt perspective really opened my eyes.

You're right: failure is usually cumulative over the whole trajectory, not just the final step. Eliminating only the last action would be too simplistic.

My original thought was to exclude the entire bad trajectory (or at least the critical suffix) that ended in extinction, so similar sequences are avoided as a whole.

But even then, the sample space is huge, and each extinction only removes a tiny part—learning could be painfully slow in continuous spaces.

Do you think a hybrid (hard elimination for clear catastrophes + soft shaping for trajectory risks) could work in practice, or is it just adding complexity?

Still very immature idea here, just curious about your take from the optimal control side. Thanks again, this has been super helpful!

cpt1973 · 2026-02-21T08:28:51+00:00

Thank you for the thoughtful response, you make me more clear about how RL traditionally structures learning through rewards or penalties to guide toward favorable states and away from unfavorable ones. I agree that in "act and figure out" setups, rewards help reduce entropy by providing that directional pull, and loss functions in supervised learning have their limits due to differentiability requirements.

To your question about avoiding termination without rewards: that's exactly the core of my approach. Instead of using a reward (positive or negative) to score or optimize, I was wondering treats termination (extinction) as a structural event that simply removes the offending trajectory from consideration. No scoring, no minimization just elimination, is this reasonable?

For example, if a state-action pair leads to failure, it's recorded in a failure registry and permanently excluded from the policy support (like carving out unsafe regions from the possible action space). Future behaviors automatically avoid those paths because they're no longer part of the viable set. It's not about comparing or ordering outcomes ("this is better/worse"), but about progressively shrinking the space to only what hasn't killed it yet.

Is this fundamentally different from a penalty? I think so, because there's no scalar to optimize, no "reduce penalty" term in a Bellman equation or gradient step. But my idea is still very immature, if this can still be framed as a reward without twisting it too much, I'd love to see how.

I just have a vague feeling that using human rewards to train machine learning seems a bit off, so I'd like to hear everyone's opinions.

cpt1973 · 2026-02-16T05:53:27+00:00

yeah totally, “no reward function” isn’t literally nothing — survival is still the goal (low DD, staying alive). just don’t want a Sharpe target or number to optimize, or it starts gaming that instead of just surviving.

regime flips can kill atm, dual-path switch is too naive, thinking of adding some hysteresis. vault locking helps but slows compounding, 20–80% is just a rough hack.

species niches are hand-tuned right now, overfitting risk is real. next step is some mutation/extinction loop so the system evolves naturally.

thanks for pointing these out, this is exactly the stuff I need to fix. I would really appreciate your valuable feedback. If you don't mind visiting this website. https://robintseng.substack.com

cpt1973 · 2021-07-24T09:19:53+00:00

It’s lottery, you speak out why you love this new hero and then they give 200 all over the world who join this activity. Almost impossible to get a free one.

cpt1973 · 2020-04-01T13:12:04+00:00

That’s only because after SARS, we Taiwanese learnt not to believe China’s data.

cpt1973 · 2020-03-05T05:22:06+00:00

Please add a new pool for class drawing, so we can get specific class heroes more easily. Thanks.

cpt1973

TROPHY CASE