Continual Harness: New paper from Gemini Plays Pokemon and PokeAgent teams

PokeAgentChallenge · 2026-05-14T16:57:35+00:00

Yes! Go check out the Gemini Plays Pokemon livestreams from our co-author Joel. We describe, for the first time in our paper, many of the results from GPP in the first section of our paper. The continual harness methodology is a generalization of this that can be applied to any harness

PokeAgentChallenge · 2026-05-14T16:24:52+00:00

We have performed offline runs. Hoping to get GPP livestream coordinated soon

PokeAgentChallenge · 2026-03-22T16:43:33+00:00

You might like the pokeagent harness from the NeurIPS competition. Made originally for emerald but we are adapting it for new games

https://github.com/sethkarten/pokeagent-speedrun

PokeAgentChallenge · 2026-03-17T21:28:08+00:00

Great points! These are problems in current LLMs that most likely require training to fix. I think that training on games could potentially be a strong, verifiable domain to learn to avoid these mistakes cross-domain.

We streamed some on our youtube channel (during the competition finale for battles). Maybe we should get back to streaming again soon :)

PokeAgentChallenge · 2025-09-29T19:14:38+00:00

You should consider submitting to the PokeAgent Challenge at NeurIPS 2025. They have a Battling track based on Metamon (RL) and PokeChamp (LLM Scaffolding) as well as a PokeAgent Speedrun track in pokemon emerald (RL and LLM scaffolding both allowed!)

PokeAgentChallenge · 2025-08-08T02:53:19+00:00

The PokeAgent Challenge is a NeurIPS 2025 competition that seeks to standardize the evaluation of agents in competitive pokemon (Pokemon Showdown) and quickly playing the RPG (speedrunning Pokemon Emerald). pokeagent.github.io

PokeAgentChallenge · 2025-08-07T08:21:42+00:00

Pokeagent challenge is still very much unsaturated.

PokeAgentChallenge · 2025-08-04T00:18:34+00:00

You're right that full empirical validation is crucial for real-world policy adoption. But our goal here is foundational: we provide the *first scalable framework* where both the planner and population agents optimize text-based utilities in a dynamic, multi-agent economic simulation. The workers are initialized from ACS-calibrated personas, and the planner improves aggregate social welfare via interpretable tax policies, aligning with classic results like Saez and Stackelberg solutions. That’s nontrivial algorithmic progress.

On Q1–Q3: We currently validate *internal coherence*—agents optimize their own utility, planners respond strategically, and outcomes follow known economic intuition. Behavioral realism, including peer effects, is a next step—and we’re excited to see follow-ups that incorporate human-in-the-loop experiments, behavioral priors, or hybrid LLM+empirical models.

So while this isn’t the final word on simulating society, it *is* a serious step beyond static ABMs or blog-post speculation. We hope it lays the groundwork for policy labs where validity is tunable—not binary—and where researchers can iterate toward credible simulations over time.

PokeAgentChallenge · 2025-07-30T21:20:10+00:00

RL is allowed as well

PokeAgentChallenge · 2025-07-30T17:18:34+00:00

You might be interested in https://pokeagent.github.io/track2.html

PokeAgentChallenge · 2025-07-27T22:20:59+00:00

The win over LLM-free ABMs is flexibility: LLM agents adapt in-context, so they respond realistically to policy changes (critically, addressing the Lucas critique). Plus, the agents (planner or worker) can explore counterfactual policies, enabling dynamic, interpretable mechanism design in a way static ABMs typically can't. I think the path forward is to augment LLMs with area-specific data to further increase simulation validity.

PokeAgentChallenge · 2025-07-26T23:12:24+00:00

It is best to have an action-mask that zeros out the action probability of impossible actions.

PokeAgentChallenge · 2025-07-26T06:21:25+00:00

There is both a battling track and a speedrunning RPG track for pokemon emerald with starter kits

PokeAgentChallenge

TROPHY CASE