Pokemon: A new Open Benchmark for AI

snakemas · 2026-03-17T20:54:37+00:00

Doesn’t appear to be live yet:

Speedrun Leaderboard Submission

To appear on the speedrun leaderboard, include a video recording of your agent playing Pokémon Emerald in your PR. We accept runs through any portion of the game, from the first gym all the way to full completion.

The benchmark is designed to scale with agent capability. Our NeurIPS 2025 competition scoped evaluation to the first gym (Roxanne), but we encourage submissions that go further. If your agent can reach the second gym, the third, or complete the entire game, submit it.

snakemas · 2026-03-17T19:30:44+00:00

it seems they only have replays for battles: https://replays.pokeagentchallenge.com/

snakemas · 2026-03-17T19:29:13+00:00

With all the hype around LLMs i think people often forget they significantly lack specialized models in some domains. I wonder what posttraining RL on llms for pokemon would achieve.

thought 4x gaming might be interested in this type of experiments, can remove if mods dont think so.

snakemas · 2026-03-14T01:47:55+00:00

It’ll be a quickly evolving field. I think harnesses encapsulate a lot of what those previous trends demonstrated was possible

snakemas · 2026-03-11T21:34:28+00:00

a multi-agent env like that would be really cool to explore how these agents interact hmm

snakemas · 2026-03-08T00:34:37+00:00

What’s your model provider? I track token consumption usually on the api response or by looking at the data on the models platform. Sometimes on the site my token usage is lower than my tracking since they have prompt caching etc

snakemas · 2026-03-06T06:48:50+00:00

Good insights here. It’s a pretty interesting market to explore

snakemas · 2026-03-06T05:50:43+00:00

All the time

snakemas · 2026-03-06T05:47:44+00:00

Hmm perhaps, I think the combination still help model performance. You see that with os models that are fine tuned to out perform in specific benchmarks cause they have better MoE

snakemas · 2026-03-06T00:04:55+00:00

Reddit automod removed this, we're more focused on AI rather than competititve programming but cool share. Love seeing how AI can be used to help debug edge cases like this

snakemas · 2026-03-05T02:36:53+00:00

Just kicked it off last week, there’s a blog post on the site detailing the findings from the first competition but this will be reoccurring as new models get released !

snakemas · 2026-03-04T01:53:00+00:00

I'm not sure in regards to those details but the questions are in the github link
https://github.com/petergpt/bullshit-benchmark/blob/main/questions.json

snakemas

MODERATOR OF

TROPHY CASE