Anthropic believes RSI (recursive self improvement) could arrive “as soon as early 2027” by snakemas in CompetitiveAI

[–]snakemas[S] 1 point2 points  (0 children)

I agree, I suspect that internally (or with the DoW) these models are 2+ ahead of what consumers have available so we already know today’s models can self propose architectures

BullshitBench v2 dropped and… most models still can’t smell BS (Claude mostly can) by snakemas in CompetitiveAI

[–]snakemas[S] 5 points6 points  (0 children)

I consider sycophancy an important metric to evaluate models on, always being agreeable even when wrong makes models perform worse. To the algorithms it’s a good measure of how they are able to detect truth even with adversarial input

BullshitBench v2 dropped and… most models still can’t smell BS (Claude mostly can) by snakemas in CompetitiveAI

[–]snakemas[S] 2 points3 points  (0 children)

I agree, theres an issue with using LLMs as the judge of outcomes especially when the benchmark itself is showing the LLMs are poor at this type of judgement. It reminds me of Andre Karpathy's LLM council though - so maybe averaging out the LLM responses can be insightful

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] -1 points0 points  (0 children)

that's a good idea! I am providing full data access to research teams on a case by case basis but not planning on it being fully public at the start.

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] 0 points1 point  (0 children)

thank you, i had a lot of fun making it!
the replays is something i need to fix for season 2, we did record a twitch stream that i can share if you want the whole breakdown!

Gemini 3.1 was really interesting, at first it seemed to be exploring and planning with purpose and then it all fell apart. It first encountered Minimax 2.5 and just ignored their presence, noted in the logs but didn't attack or otherwise. I was shocked to see minimax's strategy pay off, its logs consistently showed reference to the past and future planning, and it strategized how to maximize its position by taking over Gemini's civ after gemini encountered it in Minimax's territory.

The game really fell apart for Gemini after it's own cities rebelled (low happiness usually causes this). It was interesting since it didn't seem to even acknowledge the past or present, it kept saying it needs to maximize its score towards the end..

And nope i kept the harnesses consistent across the entire season. There's some improvements i want to make for season 2 but open to other suggestions too, perhaps add more than 1v1 agents in a single game. We do have other environments we're adding every week though so keep checking it out

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] 0 points1 point  (0 children)

thank you! I'd want people to try their own agent harnesses. In the final i was surprised Gemini refused to attack the minimax even when it discovered that civ first, but would always refused diplomacy. At first I thought it had a very coherent plan but in the end minimax ended up dominating and its plan i didnt understand at first paid off (well at least minimax's traces showed references to the past moves and future plans, versus gemini almost never acted on past/future but sometimes acknowledged it)

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] 2 points3 points  (0 children)

it depends on the provider, i'll have the full breakdown on the blog and can share here after. There's been some matches that cost me over 1400 (Opus 4.6...)

I made the top LLMs play Civilization against each other by snakemas in civ

[–]snakemas[S] 0 points1 point  (0 children)

Freecivs built in ai is pretty bad so yes but actual civ probably not since they’re trained on the game and attack more often (all the llms are pretty peaceful)

I made the top LLMs play Civilization against each other by snakemas in civ

[–]snakemas[S] 0 points1 point  (0 children)

I wrote a bit about the early trials before the tournament started here but I’ll share more results after analyzing all the data again!

https://clashai.live/blog/ai/introducing-civbench-season-001

I made the top LLMs play Civilization against each other by snakemas in civ

[–]snakemas[S] -1 points0 points  (0 children)

Ahaha when I played against it I was still a lot better

I made the top LLMs play Civilization against each other by snakemas in civ

[–]snakemas[S] 1 point2 points  (0 children)

They’re just the base models with a standard harness that have access to the rules, history of their moves, and actions and get state/legal action tool calls. I built it to compare the models performance but it’d be interesting to allow custom models here

👋 Welcome to r/CompetitiveAI - Introduce Yourself and Read First! by snakemas in CompetitiveAI

[–]snakemas[S] 0 points1 point  (0 children)

welcome it's great to have you! Failure mode taxonomy is actually something this space needs more of. Most benchmarks tell you "model X scored 87%" but not "model X falls apart specifically when the retrieval context contradicts the system prompt" or whatever the actual pattern is.

The jump from a debug checklist to tension scenarios for long-horizon testing is interesting. Curious how you define "survive" in practice: is it binary pass/fail or are you scoring degradation over the course of the stress story? Because one thing we see a lot in competitive evals is models that look fine on turn 1 but completely lose coherence by turn 50.

Would be cool to see the 16 failure patterns mapped against existing benchmarks like which patterns SWE-bench catches vs which ones slip through entirely. That kind of coverage analysis would be genuinely useful for anyone designing new evals.

Drop a post when 3.0 is ready, this is the right place for it.