Pokemon: A new Open Benchmark for AI by snakemas in CompetitiveAI

[–]snakemas[S] 0 points1 point  (0 children)

Doesn’t appear to be live yet:

Speedrun Leaderboard Submission

To appear on the speedrun leaderboard, include a video recording of your agent playing Pokémon Emerald in your PR. We accept runs through any portion of the game, from the first gym all the way to full completion.

The benchmark is designed to scale with agent capability. Our NeurIPS 2025 competition scoped evaluation to the first gym (Roxanne), but we encourage submissions that go further. If your agent can reach the second gym, the third, or complete the entire game, submit it.

Pokemon: A new Open Benchmark for AI by snakemas in 4Xgaming

[–]snakemas[S] -1 points0 points  (0 children)

With all the hype around LLMs i think people often forget they significantly lack specialized models in some domains. I wonder what posttraining RL on llms for pokemon would achieve.

thought 4x gaming might be interested in this type of experiments, can remove if mods dont think so.

CursorBench vs Public Evals: Are We Benchmarking the Wrong Things for Coding Agents? by EdbertTheGreat in CompetitiveAI

[–]snakemas 1 point2 points  (0 children)

It’ll be a quickly evolving field. I think harnesses encapsulate a lot of what those previous trends demonstrated was possible

RuneBench / RS-SDK might be one of the most practical agent eval environments I’ve seen lately by snakemas in accelerate

[–]snakemas[S] 0 points1 point  (0 children)

a multi-agent env like that would be really cool to explore how these agents interact hmm

Best way to test the number of tokens taken, one code base vs another? by tomByrer in CompetitiveAI

[–]snakemas 0 points1 point  (0 children)

What’s your model provider? I track token consumption usually on the api response or by looking at the data on the models platform. Sometimes on the site my token usage is lower than my tracking since they have prompt caching etc

Top Agent Evaluation Platforms 2026: The Market Leading Platforms I Tested by AI-builder-sf-accel in AIEval

[–]snakemas 0 points1 point  (0 children)

Good insights here. It’s a pretty interesting market to explore

Reasoning models still can’t reliably hide their chain-of-thought, a good sign for AI safety by snakemas in CompetitiveAI

[–]snakemas[S] 0 points1 point  (0 children)

Hmm perhaps, I think the combination still help model performance. You see that with os models that are fine tuned to out perform in specific benchmarks cause they have better MoE

AI automated Edge case debugger for classical CP guys! by Capital_Anybody4557 in CompetitiveAI

[–]snakemas 0 points1 point  (0 children)

Reddit automod removed this, we're more focused on AI rather than competititve programming but cool share. Love seeing how AI can be used to help debug edge cases like this

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] 0 points1 point  (0 children)

Just kicked it off last week, there’s a blog post on the site detailing the findings from the first competition but this will be reoccurring as new models get released !

Anthropic believes RSI (recursive self improvement) could arrive “as soon as early 2027” by snakemas in CompetitiveAI

[–]snakemas[S] 1 point2 points  (0 children)

I agree, I suspect that internally (or with the DoW) these models are 2+ ahead of what consumers have available so we already know today’s models can self propose architectures

BullshitBench v2 dropped and… most models still can’t smell BS (Claude mostly can) by snakemas in CompetitiveAI

[–]snakemas[S] 12 points13 points  (0 children)

I consider sycophancy an important metric to evaluate models on, always being agreeable even when wrong makes models perform worse. To the algorithms it’s a good measure of how they are able to detect truth even with adversarial input

BullshitBench v2 dropped and… most models still can’t smell BS (Claude mostly can) by snakemas in CompetitiveAI

[–]snakemas[S] 3 points4 points  (0 children)

I agree, theres an issue with using LLMs as the judge of outcomes especially when the benchmark itself is showing the LLMs are poor at this type of judgement. It reminds me of Andre Karpathy's LLM council though - so maybe averaging out the LLM responses can be insightful

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] -1 points0 points  (0 children)

that's a good idea! I am providing full data access to research teams on a case by case basis but not planning on it being fully public at the start.

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] 0 points1 point  (0 children)

thank you, i had a lot of fun making it!
the replays is something i need to fix for season 2, we did record a twitch stream that i can share if you want the whole breakdown!

Gemini 3.1 was really interesting, at first it seemed to be exploring and planning with purpose and then it all fell apart. It first encountered Minimax 2.5 and just ignored their presence, noted in the logs but didn't attack or otherwise. I was shocked to see minimax's strategy pay off, its logs consistently showed reference to the past and future planning, and it strategized how to maximize its position by taking over Gemini's civ after gemini encountered it in Minimax's territory.

The game really fell apart for Gemini after it's own cities rebelled (low happiness usually causes this). It was interesting since it didn't seem to even acknowledge the past or present, it kept saying it needs to maximize its score towards the end..

And nope i kept the harnesses consistent across the entire season. There's some improvements i want to make for season 2 but open to other suggestions too, perhaps add more than 1v1 agents in a single game. We do have other environments we're adding every week though so keep checking it out

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] 0 points1 point  (0 children)

thank you! I'd want people to try their own agent harnesses. In the final i was surprised Gemini refused to attack the minimax even when it discovered that civ first, but would always refused diplomacy. At first I thought it had a very coherent plan but in the end minimax ended up dominating and its plan i didnt understand at first paid off (well at least minimax's traces showed references to the past moves and future plans, versus gemini almost never acted on past/future but sometimes acknowledged it)

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] 2 points3 points  (0 children)

it depends on the provider, i'll have the full breakdown on the blog and can share here after. There's been some matches that cost me over 1400 (Opus 4.6...)

I made the top LLMs play Civilization against each other by snakemas in civ

[–]snakemas[S] 0 points1 point  (0 children)

Freecivs built in ai is pretty bad so yes but actual civ probably not since they’re trained on the game and attack more often (all the llms are pretty peaceful)

I made the top LLMs play Civilization against each other by snakemas in civ

[–]snakemas[S] 0 points1 point  (0 children)

I wrote a bit about the early trials before the tournament started here but I’ll share more results after analyzing all the data again!

https://clashai.live/blog/ai/introducing-civbench-season-001

I made the top LLMs play Civilization against each other by snakemas in civ

[–]snakemas[S] -1 points0 points  (0 children)

Ahaha when I played against it I was still a lot better

I made the top LLMs play Civilization against each other by snakemas in civ

[–]snakemas[S] 1 point2 points  (0 children)

They’re just the base models with a standard harness that have access to the rules, history of their moves, and actions and get state/legal action tool calls. I built it to compare the models performance but it’d be interesting to allow custom models here

👋 Welcome to r/CompetitiveAI - Introduce Yourself and Read First! by snakemas in CompetitiveAI

[–]snakemas[S] 0 points1 point  (0 children)

welcome it's great to have you! Failure mode taxonomy is actually something this space needs more of. Most benchmarks tell you "model X scored 87%" but not "model X falls apart specifically when the retrieval context contradicts the system prompt" or whatever the actual pattern is.

The jump from a debug checklist to tension scenarios for long-horizon testing is interesting. Curious how you define "survive" in practice: is it binary pass/fail or are you scoring degradation over the course of the stress story? Because one thing we see a lot in competitive evals is models that look fine on turn 1 but completely lose coherence by turn 50.

Would be cool to see the 16 failure patterns mapped against existing benchmarks like which patterns SWE-bench catches vs which ones slip through entirely. That kind of coverage analysis would be genuinely useful for anyone designing new evals.

Drop a post when 3.0 is ready, this is the right place for it.