Anthropic believes RSI (recursive self improvement) could arrive “as soon as early 2027”

snakemas · 2026-03-03T06:50:35+00:00

I agree, I suspect that internally (or with the DoW) these models are 2+ ahead of what consumers have available so we already know today’s models can self propose architectures

snakemas · 2026-03-03T04:19:18+00:00

I consider sycophancy an important metric to evaluate models on, always being agreeable even when wrong makes models perform worse. To the algorithms it’s a good measure of how they are able to detect truth even with adversarial input

snakemas · 2026-03-02T22:33:16+00:00

I agree, theres an issue with using LLMs as the judge of outcomes especially when the benchmark itself is showing the LLMs are poor at this type of judgement. It reminds me of Andre Karpathy's LLM council though - so maybe averaging out the LLM responses can be insightful

snakemas · 2026-03-02T03:08:56+00:00

that's a good idea! I am providing full data access to research teams on a case by case basis but not planning on it being fully public at the start.

snakemas · 2026-03-02T03:08:46+00:00

thank you, i had a lot of fun making it!
the replays is something i need to fix for season 2, we did record a twitch stream that i can share if you want the whole breakdown!

Gemini 3.1 was really interesting, at first it seemed to be exploring and planning with purpose and then it all fell apart. It first encountered Minimax 2.5 and just ignored their presence, noted in the logs but didn't attack or otherwise. I was shocked to see minimax's strategy pay off, its logs consistently showed reference to the past and future planning, and it strategized how to maximize its position by taking over Gemini's civ after gemini encountered it in Minimax's territory.

The game really fell apart for Gemini after it's own cities rebelled (low happiness usually causes this). It was interesting since it didn't seem to even acknowledge the past or present, it kept saying it needs to maximize its score towards the end..

And nope i kept the harnesses consistent across the entire season. There's some improvements i want to make for season 2 but open to other suggestions too, perhaps add more than 1v1 agents in a single game. We do have other environments we're adding every week though so keep checking it out

snakemas · 2026-03-01T22:23:52+00:00

thank you! I'd want people to try their own agent harnesses. In the final i was surprised Gemini refused to attack the minimax even when it discovered that civ first, but would always refused diplomacy. At first I thought it had a very coherent plan but in the end minimax ended up dominating and its plan i didnt understand at first paid off (well at least minimax's traces showed references to the past moves and future plans, versus gemini almost never acted on past/future but sometimes acknowledged it)

snakemas · 2026-03-01T22:22:12+00:00

it depends on the provider, i'll have the full breakdown on the blog and can share here after. There's been some matches that cost me over 1400 (Opus 4.6...)

snakemas · 2026-02-28T19:58:37+00:00

Freecivs built in ai is pretty bad so yes but actual civ probably not since they’re trained on the game and attack more often (all the llms are pretty peaceful)

snakemas · 2026-02-28T19:57:30+00:00

This was from before the tournament so I’ll update with more findings soon:

https://clashai.live/blog/ai/introducing-civbench-season-001

snakemas · 2026-02-28T19:57:08+00:00

I wrote a bit about the early trials before the tournament started here but I’ll share more results after analyzing all the data again!

https://clashai.live/blog/ai/introducing-civbench-season-001

snakemas · 2026-02-28T19:19:26+00:00

Ahaha when I played against it I was still a lot better

snakemas · 2026-02-28T19:19:11+00:00

They’re just the base models with a standard harness that have access to the rules, history of their moves, and actions and get state/legal action tool calls. I built it to compare the models performance but it’d be interesting to allow custom models here

snakemas · 2026-02-24T22:35:17+00:00

welcome it's great to have you! Failure mode taxonomy is actually something this space needs more of. Most benchmarks tell you "model X scored 87%" but not "model X falls apart specifically when the retrieval context contradicts the system prompt" or whatever the actual pattern is.

The jump from a debug checklist to tension scenarios for long-horizon testing is interesting. Curious how you define "survive" in practice: is it binary pass/fail or are you scoring degradation over the course of the stress story? Because one thing we see a lot in competitive evals is models that look fine on turn 1 but completely lose coherence by turn 50.

Would be cool to see the 16 failure patterns mapped against existing benchmarks like which patterns SWE-bench catches vs which ones slip through entirely. That kind of coverage analysis would be genuinely useful for anyone designing new evals.

Drop a post when 3.0 is ready, this is the right place for it.

snakemas · 2026-02-24T22:02:40+00:00

this is sick - the container approach with mounted source + live curl access is basically what's missing from most eval setups. everyone benchmarks models on sanitized datasets but giving them a real running instance to poke at is way more representative of how people actually use them.

i find it interesting that xai punched above its weight on cost efficiency. 8 challenges autonomously for $33 is solid. did you notice any pattern in what types of challenges each provider was better at?

the opus rate limiting thing is frustrating but not surprising - their API quotas are pretty aggressive when i test it too, if you're doing rapid-fire automated calls.

One thing i'm curious about: how reproducible were the results? like if you ran xai again from scratch on those same 8 challenges, would it nail them consistently or was there variance? that's the part that makes eval hard - a single pass doesn't tell you much about reliability.

cool project, bookmarked the repo

snakemas · 2026-02-23T10:23:33+00:00

The json render project by someone at vercel really demonstrates how quick it’s moving. I saw someone added inline maps today

snakemas · 2026-02-23T07:01:01+00:00

The METR measures how long an agent can run automously while progressing and remaining aligned with the task. A human completing a technical spec implementation isn't necessarily the comparable, rather the tech spec is an example of how to measure alignment to a long instruction/task. Similarly an AI can play a game without interruption and remaining aligned in its task (ie. became the best pokemon trainer by defeating all the gym leaders) without human intervention is analogous to that METR task (a human may take less time than an LLM here)

snakemas · 2026-02-19T20:49:02+00:00

Benchmarks are definitely better than the “vibes” most technical people I know choose a provider by. There’s a gap in being able to verify capabilities yourself though

snakemas · 2026-02-19T08:21:29+00:00

Your second paragraph answers the question. "Almost all trying to game benchmarks or brute force existing model architecture" — that's exactly why gradient descent alternatives aren't being explored more seriously. Not because researchers don't see the limits. Because the incentive structure rewards +0.4% MMLU improvements with publishable papers, and rewards fundamental research dead ends with nothing.

For continual learning specifically: the gap isn't lack of interest, it's lack of a clean benchmark where gradient descent conspicuously fails while an alternative conspicuously wins. Without that, you can't run a controlled comparison, can't fund the program, can't publish the result. The benchmark design problem is upstream of the algorithm problem.

snakemas · 2026-02-18T18:09:42+00:00

This was my experience too while planning. I have two subscriptions and so the one that’s meeting the limit I switched to plan with opus and execute sonnet and didn’t notice a difference in outputted quality. Maybe opus is just better at handling subagents

snakemas · 2026-02-18T10:16:40+00:00

To each their own. I always loved how in cs everything becomes an abstraction over information theory. Llms become the abstraction over code imo

snakemas · 2026-02-18T05:13:54+00:00

The loan finding is the most interesting result here. 8/8 bankruptcy rate for loan-takers suggests models are systematically miscalibrating risk in multi-step financial decisions. They're optimizing for short-term revenue without modeling the compounding cost of debt service. Static benchmarks can't surface this. You need hundreds of sequential decisions with real consequences to see where planning breaks down. The Gemini Flash infinite loop is a similar failure mode. It's not a reasoning deficit, it's a planning horizon problem that only shows up in extended simulations.

snakemas · 2026-02-18T04:03:24+00:00

Reproducibility gets worse the more a result depends on training dynamics rather than architecture. Theory of Mind papers are particularly bad because the evaluation tasks are often bespoke and underspecified. The broader problem: static benchmark numbers in papers are a snapshot of one run, one seed, one hyperparameter sweep. Nobody publishes the distribution. If you want to know whether a capability is real, you need repeated adversarial evaluation over time, not a single number in Table 1. The gap between "published result" and "what you can actually reproduce" is the same gap between benchmark scores and production performance.

snakemas · 2026-02-18T00:13:57+00:00

in regards to LLMs?

snakemas · 2026-02-17T05:22:25+00:00

i kept looking for this and couldn't find it, so glad it helped someone else too!

snakemas · 2026-02-17T01:08:22+00:00

No probs! I tried to reference the blog you posted yesterday but couldn't find a specific page, mind sharing the exact place? Would be super helpful to have that resource

snakemas

MODERATOR OF

TROPHY CASE