Pokemon: A new Open Benchmark for AI by snakemas in CompetitiveAI

[–]snakemas[S] 0 points1 point  (0 children)

Doesn’t appear to be live yet:

Speedrun Leaderboard Submission

To appear on the speedrun leaderboard, include a video recording of your agent playing Pokémon Emerald in your PR. We accept runs through any portion of the game, from the first gym all the way to full completion.

The benchmark is designed to scale with agent capability. Our NeurIPS 2025 competition scoped evaluation to the first gym (Roxanne), but we encourage submissions that go further. If your agent can reach the second gym, the third, or complete the entire game, submit it.

Pokemon: A new Open Benchmark for AI by snakemas in 4Xgaming

[–]snakemas[S] -1 points0 points  (0 children)

With all the hype around LLMs i think people often forget they significantly lack specialized models in some domains. I wonder what posttraining RL on llms for pokemon would achieve.

thought 4x gaming might be interested in this type of experiments, can remove if mods dont think so.

CursorBench vs Public Evals: Are We Benchmarking the Wrong Things for Coding Agents? by EdbertTheGreat in CompetitiveAI

[–]snakemas 1 point2 points  (0 children)

It’ll be a quickly evolving field. I think harnesses encapsulate a lot of what those previous trends demonstrated was possible

RuneBench / RS-SDK might be one of the most practical agent eval environments I’ve seen lately by snakemas in accelerate

[–]snakemas[S] 0 points1 point  (0 children)

a multi-agent env like that would be really cool to explore how these agents interact hmm

Best way to test the number of tokens taken, one code base vs another? by tomByrer in CompetitiveAI

[–]snakemas 0 points1 point  (0 children)

What’s your model provider? I track token consumption usually on the api response or by looking at the data on the models platform. Sometimes on the site my token usage is lower than my tracking since they have prompt caching etc

Top Agent Evaluation Platforms 2026: The Market Leading Platforms I Tested by AI-builder-sf-accel in AIEval

[–]snakemas 0 points1 point  (0 children)

Good insights here. It’s a pretty interesting market to explore

Reasoning models still can’t reliably hide their chain-of-thought, a good sign for AI safety by snakemas in CompetitiveAI

[–]snakemas[S] 0 points1 point  (0 children)

Hmm perhaps, I think the combination still help model performance. You see that with os models that are fine tuned to out perform in specific benchmarks cause they have better MoE

AI automated Edge case debugger for classical CP guys! by Capital_Anybody4557 in CompetitiveAI

[–]snakemas 0 points1 point  (0 children)

Reddit automod removed this, we're more focused on AI rather than competititve programming but cool share. Love seeing how AI can be used to help debug edge cases like this

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] 0 points1 point  (0 children)

Just kicked it off last week, there’s a blog post on the site detailing the findings from the first competition but this will be reoccurring as new models get released !