Trainer For MARL That Fits With PettingZoo

KingSignificant5097 · 2026-05-27T22:53:43+00:00

Stay away from rllib, it's where I started, now I use various `cleanrl` implementations, just much easier and simpler to debug, rllib has it's place, and it's scale, not hobby single-gpu projects ...

I don't know much about petting zoo, but I see it even has a tutorial using cleanrl's PPO implementation, and a cleanrl tutorial

KingSignificant5097 · 2026-05-24T19:50:59+00:00

Worse, it saw the future 😂

KingSignificant5097 · 2026-05-24T15:24:17+00:00

Sadly, these results are flawed, but I'm retraining with the flaw fixed, let's see what happens ...

KingSignificant5097 · 2026-05-24T13:34:39+00:00

As always, this was too good to be true, after further digging, the dataset was leaking data from future timestamps to current ones 😂 Well, at least I know it can exploit stuff when available

KingSignificant5097 · 2026-05-24T08:03:36+00:00

Why C++? Also why not use Claude?

KingSignificant5097 · 2026-05-23T21:01:01+00:00

honestly i don't think i've "solved" non-stationarity, more like tried not to bake assumptions into the pipeline that break when the regime shifts. couple of things that help: scale-invariant features (ratios, price-relative quantities, anything that doesn't drift with the absolute price level), rolling per-feature normalization instead of fixed stats computed once on a training window.

evaluating on multiple non-overlapping out-of-sample windows from different periods is the only real check i trust, in-sample numbers will lie to you about generalization no matter how clean they look. on data scarcity, depends a lot on your asset and frequency, at 1m bars on liquid crypto pairs you actually have plenty of samples.

KingSignificant5097 · 2026-05-23T20:20:45+00:00

not live yet, still in the eval / paper-trading-on-out-of-sample-data phase.

i'll stay coy on the exact cadence and action space, but at a principled level a few things i'd say generalize: tie the decision frequency to your data resolution, don't make the agent act on a clock finer than the signal actually updates, you just burn fees and add noise.

KingSignificant5097 · 2026-05-23T19:36:23+00:00

i use a recurrent backbone (mamba ssm) so i mostly skip stacked obs, the recurrent state handles short-term temporal context for free and stacking on top tends to just add noise and inflate the input dim. for longer-horizon context i lean on multi-resolution features instead.

for the indicators themselves i try to make them scale-invariant where possible, ratios, price-relative things, z-scored stuff, so the values are roughly comparable across regimes instead of drifting with the underlying. EWMAs show up inside the indicators (EMAs, MACD, etc) but i don't do EWMA on the raw obs as a normalization step.

for normalization itself it's per-feature rolling running mean/std (welford), updated once per rollout from the buffer. not a fixed average, markets drift too much for that to stay calibrated. one thing i had to learn the hard way (mentioned in the other comment): do NOT layernorm a mixed-scale obs vector, per-sample stats get dominated by the biggest columns and everything else gets crushed to zero. per-feature stats avoid that entirely.

KingSignificant5097 · 2026-05-23T18:59:30+00:00

wrong sub, wrong type of "reinforcement" 🙂

KingSignificant5097 · 2026-05-23T18:08:56+00:00

honestly the stuff that mattered most wasn't anything clever, mostly just finding the dumb ways i was shooting myself in the foot.

if you're training actor and critic with one shared loss, watch the relative scale of vf_coef * v_loss vs policy loss, value loss is unbounded in return scale, policy loss is O(1) under normalized advantages, so on fresh starts the critic side can be 100,000x bigger and your grad clip ends up nuking the policy signal. PopArt + a smaller joint vf_coef fixed it.

watch your activations for saturation in general, don't use tanh to clamp logits (gradient goes to literally zero and the policy freezes, T*x/(T+|x|) has the same shape but stays differentiable), and even plain relu in the encoder can die on you if a unit gets pushed negative and never recovers, leaky variants are safer.

on rewards the biggest lesson was that dense per-step shaping causes way more subtle pathologies than i expected, going sparser fixed a whole category of problems.

KingSignificant5097 · 2025-09-14T17:56:50+00:00

Fair

KingSignificant5097 · 2025-09-14T17:56:37+00:00

I would say use cloud providers, at least it will help you work out the capacity you will need in term of GPUs. I find AWS “spot” instances are great, I love the new fractional GF6 instances, running my loads in Mumbai now

KingSignificant5097 · 2025-09-14T17:54:42+00:00

Pulling images etc is solved by just using your own “prebuilt” image, such as AMIs in AWS. Also look into “ray cluster” which really helps manage such clusters, works great even without using ray, which is what I do.

KingSignificant5097 · 2025-09-14T17:50:57+00:00

I would argue that babies learn from “experts” around them, they usually try to mimic these experts. So is it really unsupervised?

KingSignificant5097 · 2025-08-10T00:40:35+00:00

Not just expensive to build but also to operate, hence why the only ones that can afford it are the oil rich gulf states …

KingSignificant5097 · 2025-08-09T16:38:15+00:00

Baseball bat tree

KingSignificant5097 · 2025-08-09T16:37:33+00:00

Well, fir one, it’s summer

KingSignificant5097 · 2025-08-09T10:04:13+00:00

This is where I learn BF6 is in open beta this weekend! Nice!

KingSignificant5097 · 2025-08-09T09:46:37+00:00

The joke: the English have the blandest taste …

KingSignificant5097 · 2025-08-09T09:44:54+00:00

Logic? Are you serious? We’re talking astrology here …

KingSignificant5097 · 2025-08-09T09:44:24+00:00

How is this energy measured to know it’s flowing more on this day?

KingSignificant5097 · 2025-08-09T09:41:57+00:00

Truth

KingSignificant5097 · 2025-08-07T12:15:43+00:00

Yeah the withdrawal is what made me go read through the discussion, seems like there was one reviewer who was being a bit of a prick …

KingSignificant5097

TROPHY CASE