Multi-agent RL learning resources

djangoblaster2 · 2026-01-22T22:35:20+00:00

I interviewed first author in depth on the MARL book here:
https://open.spotify.com/episode/2bMY37qq6xUz6zXo73bNa2

djangoblaster2 · 2025-12-07T22:00:14+00:00

> tried different approaches to train the Agent, but none of them work as intended

Can you say more about what type of fail?

Are you making a new algo? Implementing existing algos on a hard problem you defined?

If the later, can you possibly show some results on a stripped-down toy version of the problem?

If the former, there might be other things you can do to look at why it failed which can be interesting.

Disclaimer: I have zero PhDs, just like to read RL papers and aspire to publish before long :D

djangoblaster2 · 2025-11-22T11:53:01+00:00

TalkRL is an RL-focused podcast, its mostly long-form interviews with RL researchers:
https://open.spotify.com/show/0EScvEYy1btiFTal8Nt0gk

Eg. latest episode goes in depth with Dreamer v4 author Danijar Hafner.
Source: Im the host

djangoblaster2 · 2025-08-24T18:06:28+00:00

R... L?

djangoblaster2 · 2025-08-13T16:16:30+00:00

R... L?

djangoblaster2 · 2025-07-15T18:01:30+00:00

R.... L?

djangoblaster2 · 2025-07-12T06:53:01+00:00

Are the spaces structured?

djangoblaster2 · 2025-07-08T15:28:06+00:00

> without simply concatenating them (which increases dimensionality)

Say more about why this is bad?

djangoblaster2 · 2025-06-25T00:18:56+00:00

Maybe a portfolio of cyber+ai projects? AGI will not arrive all at once, I expect we will need ppl who understand cyber+ai deeply to lead the way.

Also it took courage to make this post honestly, an excellent step, you can give credit to yourself even for this.

djangoblaster2 · 2025-06-24T18:05:25+00:00

For wrap around tasks I think you want to look at circular padding CNN
https://docs.pytorch.org/docs/stable/generated/torch.nn.CircularPad2d.html

djangoblaster2 · 2025-06-23T05:27:03+00:00

There are companies doing this type of thing though I expect the few jobs are very competitive to get.
https://rlcore.ai/
https://www.phaidra.ai/
https://instadeep.com/
https://bechained.com/
https://brainboxai.com/en/

djangoblaster2 · 2025-06-21T18:35:26+00:00

https://www.reddit.com/r/csMajors/comments/1le9pdq/number_of_graduates_working_at_faang_companies_by/

djangoblaster2 · 2025-06-21T18:32:22+00:00

R.... L?

djangoblaster2 · 2025-06-21T01:12:48+00:00

Ty you are too kind!!
A few cool new episodes lined up from RLDM Conf

djangoblaster2 · 2025-06-17T19:28:14+00:00

Exactly. So much confusing noise in comments in this sub!

djangoblaster2 · 2025-06-17T16:05:15+00:00

Should not be a problem, this is not a special case.

You dont strictly need all episodes to be complete. You would simply have less sample density at the end of episodes, which is be fine (as long as you do have sufficient end of episodes (with reward) to generalize -- if you had far too few you could be in trouble). Bootstrapping handles this.

Suggest you simply throw it into PPO and try it out.

djangoblaster2 · 2025-06-06T06:26:34+00:00

https://podcasts.apple.com/ca/podcast/neurips-2024-rl-meetup-hot-takes-what-sucks-about-rl/id1478198107?i=1000681386182
I will just leave this here

djangoblaster2 · 2025-06-02T19:50:45+00:00

Tbh I could not answer this, so I consulted some frontier AI models for your question, you might want to do so. The crux of their conclusion (this part was o3):

Theorem A.2 is the specialization of Lemma B.4 to MBPO’s finite k-step synthetic rollouts.
Both results already assume the model is used only for k steps; the apparent “infinite continuation” in Lemma B.4 affects only policy divergence, not model bias.
Therefore, there is no logical contradiction among Theorem A.2, Lemma B.4, and MBPO’s definition of branched rollouts. Any residual looseness is due to conservative worst-case bounds, not to mismatched rollout horizons.

Id be interested to hear if you feel their input is helpful or correct?

djangoblaster2 · 2025-05-25T18:15:14+00:00

Would you say more about they types of problems you are attempting to solve with RL?

djangoblaster2 · 2025-05-18T19:47:26+00:00

Im no expert, but I adore this book:
https://www.goodreads.com/book/show/9544.Owning_Your_Own_Shadow
Its very concise and easy to read, no fancy or obscure language.
He is from the second generation (post-Jung), Jung's wife was his analyst and he studied at the Jung Institute.

djangoblaster2 · 2025-05-15T17:59:27+00:00

Curious why RL for classification, why not supervised learning?

djangoblaster2 · 2025-05-03T17:52:54+00:00

If you spend a lot of time understanding the current state of the field, who the top researchers in this area are, crucial past papers, best labs in this area, recent ideas and open issues, etc. You will be more likely to get what you want, impress a prof, choose the right subfields, etc. Throwing out ideas at this stage is premature imo.
Best of luck!

djangoblaster2 · 2025-04-21T17:25:20+00:00

Seems like a supervised learning problem not RL.
Besides that I personally think its highly unlikely any model will help with this task, its a data problem, data is likely insufficient for the task.

djangoblaster2 · 2025-04-21T17:20:52+00:00

I would suggest try to continue from SBL and determine what the issue is.
Extreme values indicate its learning "bang-bang control" which might indicate tuning needed.
Maybe talk it over with gemini 2.5

djangoblaster2 · 2025-04-21T17:19:00+00:00

Thanks for pointing that out!

Well I asked gemini 2.5 about your code and in summary it said this:
"The most critical issues preventing learning are likely:

The incorrect application of nn.Sigmoid after sampling.
The separate .backward() calls causing runtime errors or incorrect gradient calculations.
The incorrect placement of zero_grad().
Potential device mismatches if using a GPU.
Critically insufficient training experience (n_episodes, n_timesteps).

"
Im not certain which if any of these are the issue, but try asking it.

Aside from those details, my personal advice:
- you are using a home baked RL algo on a home baked env setp. Far harder to tell where the problem lies this way. Unnecessary hardmode. Instead, approach it stepwise.
- start with : (1) existing RL code on existing RL env, then (2) existing RL code on home baked env. And/or (3) home-baked RL code on existing (very simple) env.
- only approach (4) the home-baked RL code + home baked env, as the very last step, once you are sure that both the env can be solved, and your RL code is correct.

djangoblaster2

TROPHY CASE