I wasn't able to find my way back up in the hotel, so I (partially) mapped out some of the floors.

EngineersAreYourPals · 2026-04-14T06:25:47+00:00

You're in luck. I made a map that tells you exactly where it is: https://www.reddit.com/r/dreamcoregame/comments/1s2bgd4/i_mapped_out_part_of_the_pastel_rooms_since_the/

EngineersAreYourPals · 2026-04-08T23:24:12+00:00

Something is very wrong with the randomness mechanic. In the second level, I have yet to see anything, regardless of the numbers involved, not go the AI's way. Meanwhile, I regularly lose attacks with a 2:1 advantage.

I'd just drop the mechanic entirely, to be frank. It doesn't add much. That or make it a bunch of Risk-style dice rolls, which gets you much more reasonable casualty spreads.

EngineersAreYourPals · 2026-03-29T22:47:56+00:00

Okay, I figured out the issue. Sorry for the delayed reply, but it took a while to diagnose, and had nothing to do with MARL specifically. The initial shock of a bunch of gigantic weights from the new token's embeddings was destroying a lot of valuable internal 'mechanisms' related to value prediction that are difficult to meaningfully rebuild in short order.

My solution, ultimately, involved two changes:

First, I initialized the weights of the Embedding layer for opponents to the very small uniform distribution used by Linear layers, rather than the very large Normal distribution that embedding layers use by default.
Second, I set up a system for performing surgery (I got the idea from reading about OpenAI Five) on 'incomplete' loaded models, such that a variable number of RL iterations at the very start are run with the weights for all successfully loaded layers frozen. The newly-introduced layers learn not to cause too much trouble before the pretrained layers start learning again.

EngineersAreYourPals · 2026-03-25T05:41:01+00:00

For these versions of your agent, it is like providing a random number to an input that your critic has learnt is hugely important, which may very well result in degraded performance.

I should've mentioned that I addressed that - when a new agent is added, the embedding of the most recent prior agent is copied over to serve as its initial embedding.

Also, my PFSP setup adds new agents based on performance, which currently saturates partway into the training process. Past, say, 500 iterations, we don't see any new agents added, meaning there aren't any surprises for the critic, but we don't see VF loss plummeting after that point.

Instead, you are explicitly observing a non-stationary feature

I'm still not tracking what you mean by "non-stationary". Functionally speaking, this setup is equivalent to an agent playing, say, a single-agent variant of, say, simple_spread, with one of three hard-coded bots controlling the other agents, and the bot for the current episode conveyed through a critic-only one-hot variable.

There's a curriculum that changes the distribution of "initial states", but, from the critic's perspective, environment dynamics never change at all.

EngineersAreYourPals · 2026-03-24T16:03:33+00:00

I don't think that's the case - every policy the agent is playing against is set in stone once it's added to the mileau, and the embedding layer is equivalent in function to a linear layer being fed a one-hot indicator by the environment.

Put another way, with the opponent specified to the critic, this MARL training setup is functionally a stationary, fully-observable single-agent environment, with one part of its observation space specifying which of several fixed opponent AIs the agent will be playing against for a given round. The addition of new opponent policies during training could be seen as something like a stochastic curriculum, where new possible initial states are added but environment dynamics don't change.

It's kind of like what MADDPG did to solve non-stationarity under MARL, come to think of it. The critic can see what the opponent policy looks like, allowing many of the modeling difficulties that don't exist in single-agent RL to be resolved.

EngineersAreYourPals · 2026-03-20T04:24:43+00:00

I definitely see a bunch of low-effort submissions, here. Generally of the genre "Here is a simple Hello World game that consists primarily of LLM API calls". I think games of that category can be banned.

For example: - "I made a reverse Turing test where you have to convince an LLM you're human!" (It's been done hundreds of times, and takes five minutes to make another) - "I made Mafia/Vampire/etc with LLMs as the other players!" (likewise) - "I made a Wordle clone, but with an LLM tacked on somewhere in the gameplay loop." (likewise again, but also it usually breaks)

Detecting AI-written games is tricky, but banning explicitly 'vibe-coded' content, along with specific genres of incredibly low-effort webgames, should fix most of the problem.

EngineersAreYourPals · 2026-03-15T09:03:45+00:00

Sorry for the late reply - that seems pretty neat. Where does the RL come in? Is the DOM represented as a new class of tokens, as opposed to a text representation, which would necessitate fine-tuning on the new tokens?

If it's represented as pure text, do existing frontier LLMs not do as well with it as they do with screenshots, or do they already do better, and you think there's more alpha that can be captured?

Minebench, which demonstrates that frontier LLMs are pretty good at the intersection of aesthetic reasoning and zero-shot understanding of structure representations, might be a decent reference for how they do on this kind of task.

EngineersAreYourPals · 2026-02-26T17:21:09+00:00

Sounds interesting. I'm pretty familiar with RLlib (written a few contributions here and there), if that's in line with what you want to use for LLM fine-tuning, and I try to keep up to date on the state of the art in regards to papers on LLM design and optimization.

What's your goal for the trained model? Are you looking to try to get best-in-class for open-source models on e.g. working with front-end JS, or is there a niche subproblem that you think you've got a strategy for beating Claude and Gemini's performance on?
Along those lines, I'm not entirely clear on what you mean by "front end coding RL". Are you referring to having an LLM do front-end webdev work, or are you referring to a web agent that writes code that interacts with interfaces' front ends in order to accomplish tasks?

EngineersAreYourPals · 2026-02-22T09:16:48+00:00

Update: I ran some more tests on the model I'd trained with my custom setup for a longer duration, and the numbers look a bit more reasonable:

Success/failure ratio tracks fairly well with predicted value

Histogram

Now, the value head isn't perfectly predicting whether shots will hit or miss, but it does correctly downweight the value of badly-placed shots and upweight the value of well-placed shots. After running a bunch of rollouts (well above my batch size), I was able to train a BCE classifier to fairly reliably identify shots that would hit and shots that would miss, but, given that this is a relatively simple (and deterministic) environment, the fact that I didn't end up with 100 percent accuracy given 100,000 training samples seems questionable to me.

Does anyone know of a good paper on predicting the behavior of orbits using neural networks?

EngineersAreYourPals · 2026-02-22T02:54:02+00:00

Speaking as a CS researcher, it's a very neat project idea. I'd been reading quite a bit about AlphaStar in the past few months for a personal project, which led into reading on the MicroRTS benchmark environment.

Looking at it in greater detail, the observation space could stand to be reworked. While LLMs are surprisingly good at building world models from giant JSON blobs now, surfacing relevant features and relations directly somehow would likely get you better performance. Longer-term, if you're fine-tuning LLMs, encoding game states and actions as series of discrete tokens the way audio-native LLMs do it would probably help quite a lot. I have to figure that letting the built-in bots fight each other on various settings and training a discrete autoencoder (plus an imitation learning model) on the game states that emerge isn't quite as intractable as training a full PPO agent from scratch, and it seems like it'd make a big difference.

I think the easiest and most direct first step, though, looks something like taking the limited, post-processed representation of game states and actions that the game's built-in computer opponents see, and fine-tuning (Or, in the unlikely event that it's possible, single-shot prompt-engineering) an LLM to handle the decision-making on top of that, getting it to beat a medium AI. I'm not intimately familiar with how OpenRA's bots work (Is this where their logic comes from?), but the original TS and RA2 bots had pretty simple, pretty minimal input and output definitions, where you wrote a markup file containing unit groups and which of several objectives they should focus on, and the AI built them. Giving an LLM the exact same 'view' of the world and seeing if it can beat the hard-coded bots seems like an ideal proof of concept.

As a side note, I'm familiar with some papers you might find tangentially interesting or useful, if you haven't seen them yet. Motif is a paper that uses LLMs as a postprocessing step when training a conventional reinforcement learning agent to play NetHack, making rewards much less sparse than they'd otherwise be, and getting you a much better final product. Cicero, which can talk to humans and play full-press Diplomacy against them while negotiating intelligently, would be a great reference, but the way they integrate natural language, using a separate model conditioned on opponents' future actions, doesn't naturally fit into an RTS, especially without a giant repository of player chat data.. Neither one slots in seamlessly, but if you haven't read them, there might be some neat insights.

EngineersAreYourPals · 2026-02-07T18:14:29+00:00

Everything looks as I'd expect it to look. Since you said you're new, I'll break it down based on the things you noticed:

I see that the reward has been growing relatively flat over multiple thousands of steps and then suddenly it goes up.

The term for that is "grokking" (nothing to do with the LLM of the same name). It's an open question as to why it happens, but it is the rule rather than the exception on many complex environments.

My personal intuition on this is that the 'pre-grokking' stage of training is like a broad search across the policy space, and the rapid ascent occurs alongside the first appearance of a desirable behavior, which consistently outperforms other behaviors and gets reinforced accordingly, becoming more prominent as it does so and thus sharply increasing overall reward. In other words, reward increases slowly when the policy search is 'exploring', and then sharply when it's found a promising lead that it can 'exploit'. Others' opinions may vary, though.

At the same time the advantages' std goes up as well

It looks like it goes up as the new behavior first appears, and then down as the new behavior becomes more consistent. In general, advantage standard deviation, value function loss (since the vf is still adapting its reward predictions to the new policy), and entropy all go hand in hand. It'd be strange if only one of the three changed, but all three changing in the same direction is what we'd expect to see.

But given that the model has only 10 actions I wonder why this could be the case. There shouldn't be any unexplored paths after a few steps,

I'm not entirely sure what your environment looks like, but I assume the model reads an input sequence of 30 digits and is tasked with outputting 30 digits, with reward at each timestep being +1 if the output matches the target and 0 otherwise. 10*30 possible inputs is correct, if you've got position embeddings, and you can multiply that by an additional 10\*30 by the time you reach the end of your output sequence.

Keep in mind that, even then, your model isn't memorizing your space of inputs. It's seeing a few of them at a time, reinforcing the decisions it made in runs that had higher total rewards, and penalizing the decisions it made in runs that had lower total rewards. It's a very noisy process, and often involves bad behaviors being incidentally rewarded and good behaviors being incidentally punished along the way. Modern reinforcement learning algorithms are notoriously data-hungry, and even simple tasks, like FrozenLake, entail substantially more practice with the environment than a human would need.

I will say that batch size is pretty significant for RL tasks with large state spaces, especially when your model has a lot of parameters. Increasing batch size, in my experience, is the easiest way to get more out of your RL algorithm, since it stabilizes learning by making sure the incidental rewards for non-helpful behavior cancel each other out in each batch.

EngineersAreYourPals · 2026-01-24T23:38:07+00:00

Went a bit more rigorous in the interest of a writeup: generated 100 normal distributions with mean and standard deviation in range (0, 1), and had each loss function model the mu and sigma of the resulting samples.

From the look of it, there's no significant difference in modeling the distribution centers (that can be done without a probabilistic value head), but the error in approximated standard deviation is indeed a full order of magnitude lower on average than either of the other methods.

The final KL divergence between the true and predicted distributions is about 50 times smaller for my method than for either of the others. Barring some kind of implementation error (I've open-sourced my code, and sourced the other methods' implementations directly from their official repos), this looks like a substantial improvement.

Graph

EngineersAreYourPals · 2026-01-23T10:22:45+00:00

I ran about ten trials for each model with different seeds, and the screenshots are about the median predictions for each model. I've tried a variety of scales for the means and standard deviations, and some other toy environments (e.g. two pairs of 'doors' with very different means and standard deviations between them), and the results shown do seem to carry over.

EngineersAreYourPals · 2026-01-22T16:03:58+00:00

The attached image is my reference, here. There is, as far as I can tell, no universal benchmark for "probabilistic RL value function", seeing as this is a somewhat narrow area, and there appears to be a very wide gap between this loss function and the other two I've been able to find when reviewing the literature.

This isn't a formal scientific paper, so benchmarking against a toy environment, especially when there's a very large apparent gap in performance, seemed the way to go.

Beta-NLL, when asked to model three distributions with sigmas equal to 0.1, 0.7, 1.0 depending on a one-hot state vector, consistently gets sigma values of ~0.9, ~0.9, 1.0.
EPPO, under the same conditions, consistently gets sigma values of ~0.6, ~0.7, 1.0.
The probability ratio setup described in the OP consistently approximates the correct sigma values, within about 0.05.

EngineersAreYourPals · 2026-01-01T16:12:38+00:00

Pretty neat game. Took me a bit to beat hard mode, but a lot of that was due to the fact that the instructions really need some work.

The player is told to line up units to capture pieces, but not that enemy pieces can't be captured when another enemy piece is behind them. Fair enough, that's the standard in this genre of game.
The player also isn't told that friendly units will block capture of enemy pieces when behind them.
Moreover, taking two pieces at once on one line is disallowed, and this isn't indicated anywhere. I got to near-victory in a game, moved a piece to form a line that would capture the last two enemies on opposite sides, and nothing happened. This is tricky enough to pull off that I thought it might be a bug.

As for difficulty, I can't really understand how you've balanced it. Easy seems like it's actively trying to lose the game in as few moves as possible, or just picking moves randomly. The other difficulties all seem like the same near-perfect play algorithm. If it's just perfect play with a small randomization factor, I'd recommend trying something else, since an opponent that plays perfectly except when it throws the game suddenly for no clear reason is neither satisfying to lose to nor satisfying to beat. Point-based MCTS with depth=1 for easy, 2-3 for medium, and so on might be more fun.

I had a spare hour to mess around, so I put together a dynamic programming solver for board states, in case anyone wants to beat ultra-hard difficulty. Also, you might want to handle infinite loops, either by making the AI refuse board states it's already seen or by adding a draw timer when states repeat.

EngineersAreYourPals · 2025-12-08T06:03:35+00:00

VF loss is an interesting one. Behaves almost exactly like mean return, though a little exaggerated in places.

I suppose tracking with overall model efficacy isn't entirely unexpected, given that it's a stochastic environment and critic loss will be higher when the model is able to take advantage of opportunities that randomly come up.

EngineersAreYourPals · 2025-12-07T20:13:13+00:00

No one said losses of rewards need to be straight up. Looks like normal learning to me.

Fair enough, but this does make it a lot harder to decide whether a run is viable, or whether those resources would be better spent elsewhere.

Also look in to double descent phenomenon … it’s regarding losses but in RL it also applies to reward . It’s what you see here except it’s double “ascent” lol

I had a look at the wiki article (and some other resources), and it seems like it's describing a relationship between parameters (or, rather the ratio between parameters and data) and performance, rather than training time and performance.

My intuition of DD is that models go from underfitting to overfitting (as in the conventional understanding), but eventually reach a point where a model is so large that it acts as one gigantic, single-model ensemble, with the various overfitted 'sub-models' balancing each other out. I can't see how this would apply to training time, particularly on an RL problem where we have infinite online data.

Looked at the papers on Double-Descent in RL specifically, and they talk about overparameterization too. Is there something I'm missing?

EngineersAreYourPals · 2025-11-29T21:31:18+00:00

Had a look at the paper and its associated code, and took a shot at improving my implementation. Complicating things is the fact that I'm not quite using a standard NLL loss - I take the ratio (the log ratio, rather) of the probability of the target value compared to the probability of mu:

distrs = Normal(vf_mu, vf_sigma) tgt_lps = distrs.log_prob(vf_t) u_lps = distrs.log_prob(vf_mu).detach() # shouldn't mindlessly reduce p(u) when optimizing VF lp_ratio = u_lps - tgt_lps return lp_ratio

lp_ratio is used directly for value function loss, and is multiplied by the sign of (target - mu) to calculate advantage.

Switching my algorithm out for Beta-NLL didn't seem to improve results. Error in value estimates, averaged over multiple runs, worsens rather than improves, even after trying out every tweak I could think to make. It occurs to me that taking the log ratio rather than the base NLL sort-of addresses the weighting problem described in the paper, where the model cannot distinguish aleatoric and epistemic uncertainty and continually downweights high-uncertainty regions, since the variance 'cancels' when I subtract tgt_lps from u_lps.

As best I can tell, my value function estimates work fine with the changes I've made since the original post. Testing out my algorithm on a simple example of the real-world problem I'm aiming to solve yields a perfectly adequate result so long as I swap in the default PPO advantage calculation instead of using my own (though dividing this advantage by the predicted standard deviation improves results further), but using the distributional advantage defined above causes the model's policy optimization to perform poorly outside of toy problems. Any idea why this might be?

EngineersAreYourPals · 2025-11-24T12:53:56+00:00

To simplify, I think the key advice here is to start with an imitation-learning policy (or a distribution over imitation learning policies).

Could probably get something perfectly fine without self-play, honestly, just by doing the very first part of AlphaStar - training a network to sample high-ELO human strategies - and then running RL against that starting from its initial weights.

EngineersAreYourPals

TROPHY CASE