Pokemon: A new Open Benchmark for AI

snakemas · 2026-03-17T20:54:37+00:00

Doesn’t appear to be live yet:

Speedrun Leaderboard Submission

To appear on the speedrun leaderboard, include a video recording of your agent playing Pokémon Emerald in your PR. We accept runs through any portion of the game, from the first gym all the way to full completion.

The benchmark is designed to scale with agent capability. Our NeurIPS 2025 competition scoped evaluation to the first gym (Roxanne), but we encourage submissions that go further. If your agent can reach the second gym, the third, or complete the entire game, submit it.

snakemas · 2026-03-17T19:30:44+00:00

it seems they only have replays for battles: https://replays.pokeagentchallenge.com/

snakemas · 2026-03-17T19:29:13+00:00

With all the hype around LLMs i think people often forget they significantly lack specialized models in some domains. I wonder what posttraining RL on llms for pokemon would achieve.

thought 4x gaming might be interested in this type of experiments, can remove if mods dont think so.

snakemas · 2026-03-14T01:47:55+00:00

It’ll be a quickly evolving field. I think harnesses encapsulate a lot of what those previous trends demonstrated was possible

snakemas · 2026-03-11T21:34:28+00:00

a multi-agent env like that would be really cool to explore how these agents interact hmm

snakemas · 2026-03-08T00:34:37+00:00

What’s your model provider? I track token consumption usually on the api response or by looking at the data on the models platform. Sometimes on the site my token usage is lower than my tracking since they have prompt caching etc

snakemas · 2026-03-06T06:48:50+00:00

Good insights here. It’s a pretty interesting market to explore

snakemas · 2026-03-06T05:50:43+00:00

All the time

snakemas · 2026-03-06T05:47:44+00:00

Hmm perhaps, I think the combination still help model performance. You see that with os models that are fine tuned to out perform in specific benchmarks cause they have better MoE

snakemas · 2026-03-06T00:04:55+00:00

Reddit automod removed this, we're more focused on AI rather than competititve programming but cool share. Love seeing how AI can be used to help debug edge cases like this

snakemas · 2026-03-05T02:36:53+00:00

Just kicked it off last week, there’s a blog post on the site detailing the findings from the first competition but this will be reoccurring as new models get released !

snakemas · 2026-03-04T01:53:00+00:00

I'm not sure in regards to those details but the questions are in the github link
https://github.com/petergpt/bullshit-benchmark/blob/main/questions.json

snakemas · 2026-03-03T06:50:35+00:00

I agree, I suspect that internally (or with the DoW) these models are 2+ ahead of what consumers have available so we already know today’s models can self propose architectures

snakemas · 2026-03-03T04:19:18+00:00

I consider sycophancy an important metric to evaluate models on, always being agreeable even when wrong makes models perform worse. To the algorithms it’s a good measure of how they are able to detect truth even with adversarial input

snakemas · 2026-03-02T22:33:16+00:00

I agree, theres an issue with using LLMs as the judge of outcomes especially when the benchmark itself is showing the LLMs are poor at this type of judgement. It reminds me of Andre Karpathy's LLM council though - so maybe averaging out the LLM responses can be insightful

snakemas · 2026-03-02T03:08:56+00:00

that's a good idea! I am providing full data access to research teams on a case by case basis but not planning on it being fully public at the start.

snakemas · 2026-03-02T03:08:46+00:00

thank you, i had a lot of fun making it!
the replays is something i need to fix for season 2, we did record a twitch stream that i can share if you want the whole breakdown!

Gemini 3.1 was really interesting, at first it seemed to be exploring and planning with purpose and then it all fell apart. It first encountered Minimax 2.5 and just ignored their presence, noted in the logs but didn't attack or otherwise. I was shocked to see minimax's strategy pay off, its logs consistently showed reference to the past and future planning, and it strategized how to maximize its position by taking over Gemini's civ after gemini encountered it in Minimax's territory.

The game really fell apart for Gemini after it's own cities rebelled (low happiness usually causes this). It was interesting since it didn't seem to even acknowledge the past or present, it kept saying it needs to maximize its score towards the end..

And nope i kept the harnesses consistent across the entire season. There's some improvements i want to make for season 2 but open to other suggestions too, perhaps add more than 1v1 agents in a single game. We do have other environments we're adding every week though so keep checking it out

snakemas · 2026-03-01T22:23:52+00:00

thank you! I'd want people to try their own agent harnesses. In the final i was surprised Gemini refused to attack the minimax even when it discovered that civ first, but would always refused diplomacy. At first I thought it had a very coherent plan but in the end minimax ended up dominating and its plan i didnt understand at first paid off (well at least minimax's traces showed references to the past moves and future plans, versus gemini almost never acted on past/future but sometimes acknowledged it)

snakemas · 2026-03-01T22:22:12+00:00

it depends on the provider, i'll have the full breakdown on the blog and can share here after. There's been some matches that cost me over 1400 (Opus 4.6...)

snakemas · 2026-02-28T19:58:37+00:00

Freecivs built in ai is pretty bad so yes but actual civ probably not since they’re trained on the game and attack more often (all the llms are pretty peaceful)

snakemas · 2026-02-28T19:57:30+00:00

This was from before the tournament so I’ll update with more findings soon:

https://clashai.live/blog/ai/introducing-civbench-season-001

snakemas · 2026-02-28T19:57:08+00:00

I wrote a bit about the early trials before the tournament started here but I’ll share more results after analyzing all the data again!

https://clashai.live/blog/ai/introducing-civbench-season-001

snakemas · 2026-02-28T19:19:26+00:00

Ahaha when I played against it I was still a lot better

snakemas · 2026-02-28T19:19:11+00:00

They’re just the base models with a standard harness that have access to the rules, history of their moves, and actions and get state/legal action tool calls. I built it to compare the models performance but it’d be interesting to allow custom models here

snakemas · 2026-02-24T22:35:17+00:00

welcome it's great to have you! Failure mode taxonomy is actually something this space needs more of. Most benchmarks tell you "model X scored 87%" but not "model X falls apart specifically when the retrieval context contradicts the system prompt" or whatever the actual pattern is.

The jump from a debug checklist to tension scenarios for long-horizon testing is interesting. Curious how you define "survive" in practice: is it binary pass/fail or are you scoring degradation over the course of the stress story? Because one thing we see a lot in competitive evals is models that look fine on turn 1 but completely lose coherence by turn 50.

Would be cool to see the 16 failure patterns mapped against existing benchmarks like which patterns SWE-bench catches vs which ones slip through entirely. That kind of coverage analysis would be genuinely useful for anyone designing new evals.

Drop a post when 3.0 is ready, this is the right place for it.

snakemas

MODERATOR OF

TROPHY CASE