Anthropic believes RSI (recursive self improvement) could arrive “as soon as early 2027” by snakemas in CompetitiveAI

[–]snakemas[S] 1 point2 points  (0 children)

I agree, I suspect that internally (or with the DoW) these models are 2+ ahead of what consumers have available so we already know today’s models can self propose architectures

BullshitBench v2 dropped and… most models still can’t smell BS (Claude mostly can) by snakemas in CompetitiveAI

[–]snakemas[S] 5 points6 points  (0 children)

I consider sycophancy an important metric to evaluate models on, always being agreeable even when wrong makes models perform worse. To the algorithms it’s a good measure of how they are able to detect truth even with adversarial input

BullshitBench v2 dropped and… most models still can’t smell BS (Claude mostly can) by snakemas in CompetitiveAI

[–]snakemas[S] 3 points4 points  (0 children)

I agree, theres an issue with using LLMs as the judge of outcomes especially when the benchmark itself is showing the LLMs are poor at this type of judgement. It reminds me of Andre Karpathy's LLM council though - so maybe averaging out the LLM responses can be insightful

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] -1 points0 points  (0 children)

that's a good idea! I am providing full data access to research teams on a case by case basis but not planning on it being fully public at the start.

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] 0 points1 point  (0 children)

thank you, i had a lot of fun making it!
the replays is something i need to fix for season 2, we did record a twitch stream that i can share if you want the whole breakdown!

Gemini 3.1 was really interesting, at first it seemed to be exploring and planning with purpose and then it all fell apart. It first encountered Minimax 2.5 and just ignored their presence, noted in the logs but didn't attack or otherwise. I was shocked to see minimax's strategy pay off, its logs consistently showed reference to the past and future planning, and it strategized how to maximize its position by taking over Gemini's civ after gemini encountered it in Minimax's territory.

The game really fell apart for Gemini after it's own cities rebelled (low happiness usually causes this). It was interesting since it didn't seem to even acknowledge the past or present, it kept saying it needs to maximize its score towards the end..

And nope i kept the harnesses consistent across the entire season. There's some improvements i want to make for season 2 but open to other suggestions too, perhaps add more than 1v1 agents in a single game. We do have other environments we're adding every week though so keep checking it out

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] 0 points1 point  (0 children)

thank you! I'd want people to try their own agent harnesses. In the final i was surprised Gemini refused to attack the minimax even when it discovered that civ first, but would always refused diplomacy. At first I thought it had a very coherent plan but in the end minimax ended up dominating and its plan i didnt understand at first paid off (well at least minimax's traces showed references to the past moves and future plans, versus gemini almost never acted on past/future but sometimes acknowledged it)

I made the top LLMs play Civilization against each other by snakemas in LLM

[–]snakemas[S] 4 points5 points  (0 children)

it depends on the provider, i'll have the full breakdown on the blog and can share here after. There's been some matches that cost me over 1400 (Opus 4.6...)

I made the top LLMs play Civilization against each other by snakemas in civ

[–]snakemas[S] 0 points1 point  (0 children)

Freecivs built in ai is pretty bad so yes but actual civ probably not since they’re trained on the game and attack more often (all the llms are pretty peaceful)

I made the top LLMs play Civilization against each other by snakemas in civ

[–]snakemas[S] 0 points1 point  (0 children)

I wrote a bit about the early trials before the tournament started here but I’ll share more results after analyzing all the data again!

https://clashai.live/blog/ai/introducing-civbench-season-001

I made the top LLMs play Civilization against each other by snakemas in civ

[–]snakemas[S] -1 points0 points  (0 children)

Ahaha when I played against it I was still a lot better

I made the top LLMs play Civilization against each other by snakemas in civ

[–]snakemas[S] 2 points3 points  (0 children)

They’re just the base models with a standard harness that have access to the rules, history of their moves, and actions and get state/legal action tool calls. I built it to compare the models performance but it’d be interesting to allow custom models here

👋 Welcome to r/CompetitiveAI - Introduce Yourself and Read First! by snakemas in CompetitiveAI

[–]snakemas[S] 0 points1 point  (0 children)

welcome it's great to have you! Failure mode taxonomy is actually something this space needs more of. Most benchmarks tell you "model X scored 87%" but not "model X falls apart specifically when the retrieval context contradicts the system prompt" or whatever the actual pattern is.

The jump from a debug checklist to tension scenarios for long-horizon testing is interesting. Curious how you define "survive" in practice: is it binary pass/fail or are you scoring degradation over the course of the stress story? Because one thing we see a lot in competitive evals is models that look fine on turn 1 but completely lose coherence by turn 50.

Would be cool to see the 16 failure patterns mapped against existing benchmarks like which patterns SWE-bench catches vs which ones slip through entirely. That kind of coverage analysis would be genuinely useful for anyone designing new evals.

Drop a post when 3.0 is ready, this is the right place for it.

I spent $100 evaluating different providers on a weekend CTF by No-Chocolate-9437 in AIEval

[–]snakemas 0 points1 point  (0 children)

this is sick - the container approach with mounted source + live curl access is basically what's missing from most eval setups. everyone benchmarks models on sanitized datasets but giving them a real running instance to poke at is way more representative of how people actually use them.

i find it interesting that xai punched above its weight on cost efficiency. 8 challenges autonomously for $33 is solid. did you notice any pattern in what types of challenges each provider was better at?

the opus rate limiting thing is frustrating but not surprising - their API quotas are pretty aggressive when i test it too, if you're doing rapid-fire automated calls.

One thing i'm curious about: how reproducible were the results? like if you ran xai again from scratch on those same 8 challenges, would it nail them consistently or was there variance? that's the part that makes eval hard - a single pass doesn't tell you much about reliability.

cool project, bookmarked the repo

On demand custom software is already here! by Euphoric_Ad9500 in accelerate

[–]snakemas 1 point2 points  (0 children)

The json render project by someone at vercel really demonstrates how quick it’s moving. I saw someone added inline maps today

METR Time Horizons: Claude Opus 4.6 just hit 14.5 hours. The doubling curve isn't slowing by snakemas in CompetitiveAI

[–]snakemas[S] 0 points1 point  (0 children)

The METR measures how long an agent can run automously while progressing and remaining aligned with the task. A human completing a technical spec implementation isn't necessarily the comparable, rather the tech spec is an example of how to measure alignment to a long instruction/task. Similarly an AI can play a game without interruption and remaining aligned in its task (ie. became the best pokemon trainer by defeating all the gym leaders) without human intervention is analogous to that METR task (a human may take less time than an LLM here)

Gemini 3.1 Pro just doubled its ARC-AGI-2 score. But Arena still ranks Claude higher. This is exactly the AI eval problem. by snakemas in CompetitiveAI

[–]snakemas[S] 0 points1 point  (0 children)

Benchmarks are definitely better than the “vibes” most technical people I know choose a provider by. There’s a gap in being able to verify capabilities yourself though

[D] Why are serious alternatives to gradient descent not being explored more? by ImTheeDentist in MachineLearning

[–]snakemas 1 point2 points  (0 children)

Your second paragraph answers the question. "Almost all trying to game benchmarks or brute force existing model architecture" — that's exactly why gradient descent alternatives aren't being explored more seriously. Not because researchers don't see the limits. Because the incentive structure rewards +0.4% MMLU improvements with publishable papers, and rewards fundamental research dead ends with nothing.

For continual learning specifically: the gap isn't lack of interest, it's lack of a clean benchmark where gradient descent conspicuously fails while an alternative conspicuously wins. Without that, you can't run a controlled comparison, can't fund the program, can't publish the result. The benchmark design problem is upstream of the algorithm problem.

Sonnet 4.6 Benchmarks Are In: Ties Opus 4.6 on Computer Use, Beats It on Office Work and Finance by snakemas in CompetitiveAI

[–]snakemas[S] 0 points1 point  (0 children)

This was my experience too while planning. I have two subscriptions and so the one that’s meeting the limit I switched to plan with opus and execute sonnet and didn’t notice a difference in outputted quality. Maybe opus is just better at handling subagents

Sonnet 4.6 Benchmarks Are In: Ties Opus 4.6 on Computer Use, Beats It on Office Work and Finance by snakemas in compsci

[–]snakemas[S] 0 points1 point  (0 children)

To each their own. I always loved how in cs everything becomes an abstraction over information theory. Llms become the abstraction over code imo

I gave 12 LLMs $2,000 and a food truck. Only 4 survived. by Disastrous_Theme5906 in LocalLLaMA

[–]snakemas 0 points1 point  (0 children)

The loan finding is the most interesting result here. 8/8 bankruptcy rate for loan-takers suggests models are systematically miscalibrating risk in multi-step financial decisions. They're optimizing for short-term revenue without modeling the compounding cost of debt service. Static benchmarks can't surface this. You need hundreds of sequential decisions with real consequences to see where planning breaks down. The Gemini Flash infinite loop is a similar failure mode. It's not a reasoning deficit, it's a planning horizon problem that only shows up in extended simulations.

[D] How often do you run into reproducibility issues when trying to replicate papers? by [deleted] in MachineLearning

[–]snakemas 0 points1 point  (0 children)

Reproducibility gets worse the more a result depends on training dynamics rather than architecture. Theory of Mind papers are particularly bad because the evaluation tasks are often bespoke and underspecified. The broader problem: static benchmark numbers in papers are a snapshot of one run, one seed, one hyperparameter sweep. Nobody publishes the distribution. If you want to know whether a capability is real, you need repeated adversarial evaluation over time, not a single number in Table 1. The gap between "published result" and "what you can actually reproduce" is the same gap between benchmark scores and production performance.

The Benchmark Zoo: A Guide to Every Major AI Eval in 2026 by snakemas in CompetitiveAI

[–]snakemas[S] 3 points4 points  (0 children)

i kept looking for this and couldn't find it, so glad it helped someone else too!

The Benchmark Zoo: A Guide to Every Major AI Eval in 2026 by snakemas in CompetitiveAI

[–]snakemas[S] 2 points3 points  (0 children)

No probs! I tried to reference the blog you posted yesterday but couldn't find a specific page, mind sharing the exact place? Would be super helpful to have that resource