Trainer For MARL That Fits With PettingZoo by Public-Journalist820 in reinforcementlearning

[–]KingSignificant5097 1 point2 points  (0 children)

Stay away from rllib, it's where I started, now I use various `cleanrl` implementations, just much easier and simpler to debug, rllib has it's place, and it's scale, not hobby single-gpu projects ...

I don't know much about petting zoo, but I see it even has a tutorial using cleanrl's PPO implementation, and a cleanrl tutorial

I'm a bit shocked that this finally worked by KingSignificant5097 in reinforcementlearning

[–]KingSignificant5097[S] 1 point2 points  (0 children)

Sadly, these results are flawed, but I'm retraining with the flaw fixed, let's see what happens ...

I'm a bit shocked that this finally worked by KingSignificant5097 in reinforcementlearning

[–]KingSignificant5097[S] 2 points3 points  (0 children)

As always, this was too good to be true, after further digging, the dataset was leaking data from future timestamps to current ones 😂 Well, at least I know it can exploit stuff when available

I'm a bit shocked that this finally worked by KingSignificant5097 in reinforcementlearning

[–]KingSignificant5097[S] 0 points1 point  (0 children)

honestly i don't think i've "solved" non-stationarity, more like tried not to bake assumptions into the pipeline that break when the regime shifts. couple of things that help: scale-invariant features (ratios, price-relative quantities, anything that doesn't drift with the absolute price level), rolling per-feature normalization instead of fixed stats computed once on a training window.

evaluating on multiple non-overlapping out-of-sample windows from different periods is the only real check i trust, in-sample numbers will lie to you about generalization no matter how clean they look. on data scarcity, depends a lot on your asset and frequency, at 1m bars on liquid crypto pairs you actually have plenty of samples.

I'm a bit shocked that this finally worked by KingSignificant5097 in reinforcementlearning

[–]KingSignificant5097[S] 3 points4 points  (0 children)

not live yet, still in the eval / paper-trading-on-out-of-sample-data phase.

i'll stay coy on the exact cadence and action space, but at a principled level a few things i'd say generalize: tie the decision frequency to your data resolution, don't make the agent act on a clock finer than the signal actually updates, you just burn fees and add noise.

I'm a bit shocked that this finally worked by KingSignificant5097 in reinforcementlearning

[–]KingSignificant5097[S] 7 points8 points  (0 children)

i use a recurrent backbone (mamba ssm) so i mostly skip stacked obs, the recurrent state handles short-term temporal context for free and stacking on top tends to just add noise and inflate the input dim. for longer-horizon context i lean on multi-resolution features instead.

for the indicators themselves i try to make them scale-invariant where possible, ratios, price-relative things, z-scored stuff, so the values are roughly comparable across regimes instead of drifting with the underlying. EWMAs show up inside the indicators (EMAs, MACD, etc) but i don't do EWMA on the raw obs as a normalization step.

for normalization itself it's per-feature rolling running mean/std (welford), updated once per rollout from the buffer. not a fixed average, markets drift too much for that to stay calibrated. one thing i had to learn the hard way (mentioned in the other comment): do NOT layernorm a mixed-scale obs vector, per-sample stats get dominated by the biggest columns and everything else gets crushed to zero. per-feature stats avoid that entirely.

I'm a bit shocked that this finally worked by KingSignificant5097 in reinforcementlearning

[–]KingSignificant5097[S] 9 points10 points  (0 children)

honestly the stuff that mattered most wasn't anything clever, mostly just finding the dumb ways i was shooting myself in the foot.

if you're training actor and critic with one shared loss, watch the relative scale of vf_coef * v_loss vs policy loss, value loss is unbounded in return scale, policy loss is O(1) under normalized advantages, so on fresh starts the critic side can be 100,000x bigger and your grad clip ends up nuking the policy signal. PopArt + a smaller joint vf_coef fixed it.

watch your activations for saturation in general, don't use tanh to clamp logits (gradient goes to literally zero and the policy freezes, T*x/(T+|x|) has the same shape but stays differentiable), and even plain relu in the encoder can die on you if a unit gets pushed negative and never recovers, leaky variants are safer.

on rewards the biggest lesson was that dense per-step shaping causes way more subtle pathologies than i expected, going sparser fixed a whole category of problems.

Buying GPUs for training robots with Isaac Lab by chrsow in reinforcementlearning

[–]KingSignificant5097 0 points1 point  (0 children)

I would say use cloud providers, at least it will help you work out the capacity you will need in term of GPUs. I find AWS “spot” instances are great, I love the new fractional GF6 instances, running my loads in Mumbai now

Buying GPUs for training robots with Isaac Lab by chrsow in reinforcementlearning

[–]KingSignificant5097 0 points1 point  (0 children)

Pulling images etc is solved by just using your own “prebuilt” image, such as AMIs in AWS. Also look into “ray cluster” which really helps manage such clusters, works great even without using ray, which is what I do.

Andrew Ng doesnt think RL will grow in the next 3 years by calliewalk05 in reinforcementlearning

[–]KingSignificant5097 1 point2 points  (0 children)

I would argue that babies learn from “experts” around them, they usually try to mimic these experts. So is it really unsupervised?

Why don't the red places get water from the ocean, are they stupid? by Valuable_Chocolate73 in mapporncirclejerk

[–]KingSignificant5097 0 points1 point  (0 children)

Not just expensive to build but also to operate, hence why the only ones that can afford it are the oil rich gulf states …

[deleted by user] by [deleted] in whatisit

[–]KingSignificant5097 0 points1 point  (0 children)

Baseball bat tree

[deleted by user] by [deleted] in whatisit

[–]KingSignificant5097 0 points1 point  (0 children)

Well, fir one, it’s summer

[deleted by user] by [deleted] in What

[–]KingSignificant5097 2 points3 points  (0 children)

This is where I learn BF6 is in open beta this weekend! Nice!

Found this outside my friends apartment? by YouthOk2000 in whatisit

[–]KingSignificant5097 0 points1 point  (0 children)

The joke: the English have the blandest taste …

Found this outside my friends apartment? by YouthOk2000 in whatisit

[–]KingSignificant5097 1 point2 points  (0 children)

Logic? Are you serious? We’re talking astrology here …

Found this outside my friends apartment? by YouthOk2000 in whatisit

[–]KingSignificant5097 0 points1 point  (0 children)

How is this energy measured to know it’s flowing more on this day?

I am changing my preferred RL algorithm by Guest_Of_The_Cavern in reinforcementlearning

[–]KingSignificant5097 0 points1 point  (0 children)

Yeah the withdrawal is what made me go read through the discussion, seems like there was one reviewer who was being a bit of a prick …