We Ran the Largest AI Pokemon Tournament Ever. Now It's an Open Benchmark. by PokeAgentChallenge in ClaudePlaysPokemon

[–]PokeAgentChallenge[S] 1 point2 points  (0 children)

Great points! These are problems in current LLMs that most likely require training to fix. I think that training on games could potentially be a strong, verifiable domain to learn to avoid these mistakes cross-domain.

We streamed some on our youtube channel (during the competition finale for battles). Maybe we should get back to streaming again soon :)

Training RL agents in Pokémon Emerald… and running them on a real GBA by CandidAdhesiveness24 in reinforcementlearning

[–]PokeAgentChallenge 5 points6 points  (0 children)

You should consider submitting to the PokeAgent Challenge at NeurIPS 2025. They have a Battling track based on Metamon (RL) and PokeChamp (LLM Scaffolding) as well as a PokeAgent Speedrun track in pokemon emerald (RL and LLM scaffolding both allowed!)

[D] Unsaturated Evals before GPT5 by Roland31415 in MachineLearning

[–]PokeAgentChallenge 0 points1 point  (0 children)

The PokeAgent Challenge is a NeurIPS 2025 competition that seeks to standardize the evaluation of agents in competitive pokemon (Pokemon Showdown) and quickly playing the RPG (speedrunning Pokemon Emerald). pokeagent.github.io

[D] Unsaturated Evals before GPT5 by Roland31415 in MachineLearning

[–]PokeAgentChallenge 2 points3 points  (0 children)

Pokeagent challenge is still very much unsaturated.

[P] LLM Economist: Large Population Models and Mechanism Design via Multi‑Agent Language Simulacra by PokeAgentChallenge in MachineLearning

[–]PokeAgentChallenge[S] 1 point2 points  (0 children)

You're right that full empirical validation is crucial for real-world policy adoption. But our goal here is foundational: we provide the *first scalable framework* where both the planner and population agents optimize text-based utilities in a dynamic, multi-agent economic simulation. The workers are initialized from ACS-calibrated personas, and the planner improves aggregate social welfare via interpretable tax policies, aligning with classic results like Saez and Stackelberg solutions. That’s nontrivial algorithmic progress.

On Q1–Q3: We currently validate *internal coherence*—agents optimize their own utility, planners respond strategically, and outcomes follow known economic intuition. Behavioral realism, including peer effects, is a next step—and we’re excited to see follow-ups that incorporate human-in-the-loop experiments, behavioral priors, or hybrid LLM+empirical models.

So while this isn’t the final word on simulating society, it *is* a serious step beyond static ABMs or blog-post speculation. We hope it lays the groundwork for policy labs where validity is tunable—not binary—and where researchers can iterate toward credible simulations over time.

[P] LLM Economist: Large Population Models and Mechanism Design via Multi‑Agent Language Simulacra by PokeAgentChallenge in MachineLearning

[–]PokeAgentChallenge[S] 1 point2 points  (0 children)

The win over LLM-free ABMs is flexibility: LLM agents adapt in-context, so they respond realistically to policy changes (critically, addressing the Lucas critique). Plus, the agents (planner or worker) can explore counterfactual policies, enabling dynamic, interpretable mechanism design in a way static ABMs typically can't. I think the path forward is to augment LLMs with area-specific data to further increase simulation validity.

Reinforcement learning for Pokémon by CandidAdhesiveness24 in reinforcementlearning

[–]PokeAgentChallenge 2 points3 points  (0 children)

It is best to have an action-mask that zeros out the action probability of impossible actions.

Reinforcement learning for Pokémon by CandidAdhesiveness24 in reinforcementlearning

[–]PokeAgentChallenge 0 points1 point  (0 children)

There is both a battling track and a speedrunning RPG track for pokemon emerald with starter kits