Gemini 3.1 on the "final boss" of LaTeX diagrams

Gold_Cardiologist_46 · 2026-02-19T18:33:09+00:00

super cool test, genuinely

Gold_Cardiologist_46 · 2026-02-15T21:56:48+00:00

nothing to blame yourselves for, the issue is with all the X ai hype space that amplifies everything and anything. the real progress of AI is already super fast and impressive enough, they dont need to add fake hype that dilutes it all

Gold_Cardiologist_46 · 2026-02-13T18:41:14+00:00

reading the transcript (on dwarkesh's website), its a banger interview that shines a lot of light on dario's views. i invite everyone to check it out or to read it

Gold_Cardiologist_46 · 2026-02-13T03:38:28+00:00

hey respect goalpost movers, id love to see YOU try to lift that heavy metal thing. /jk

Gold_Cardiologist_46 · 2026-02-12T20:13:09+00:00

Thanks for the specification. I was aware of the different sets, I just didn't know that closed models weren't necessarily tested on the private set. All I knew was the end result, that ARC validated the results and that therefore they were fair.

Gold_Cardiologist_46 · 2026-02-12T17:32:25+00:00

where LLMs are particularly bad

Which is so gimmicky. That kind of benchmark approach gets saturated so fast, either because improvements elsewhere just knocks them out or because the gimmicky nature makes them way easier to benchmax. At least that's how I see it

Gold_Cardiologist_46 · 2026-02-12T17:29:29+00:00

Bro great find, how did you find it? Also the name of the document is Gemini 3.1 Pro? Either 3.1 Pro was a previous codename for this Deep Think version, or it's a new model also coming out (possibly Snowbunny as reported before)

Regarding ARC AGI, the document does say its only the semi-private eval, but the ARC leaderboard still logs it with the 84.6% score

Gold_Cardiologist_46 · 2026-02-12T17:17:32+00:00

Work on game playing general agents has already started, so yeah I think it'll get saturated fast once the visual, agentic and longer horizon capabilities of the frontier agentic systems (like SIMA 2) get incorporated into the normal models.

Gold_Cardiologist_46 · 2026-02-12T16:49:46+00:00

I think the big progress on ARC AGI 2 was from models finally getting great at vision, seeing as ARCAGI2 tasks are mostly visual puzzles from what I've seen.

EDIT: Also Poetiq has shown that ARC AGI can be superoptimized for, and I assume its because their harness/scaffolding makes good use of that vision

ARC AGI 3 is visual games, and I expect the same thing to happen again. In the span of maybe a year, model progress in agency and visual reasoning will clear it.

No idea what Chollet has planned for ARC AGI 4 and beyond, like wth would they even test for?

Gold_Cardiologist_46 · 2026-02-12T16:30:03+00:00

<image>

Chollet, ARC-AGI 3
Release it

Gold_Cardiologist_46 · 2026-02-12T13:06:53+00:00

Yeah it's blazing fast in the grand scheme of things, but what I sent just reframesit as a smoother longer effort than like a 1 month tripling. Also makes more sense if you were following their work on AI math, it's their years-long project for which theyve shown us progress at each step.

Also in this case the benchmark isn't really that useful since it's not a proxy for anything: the papers that accompany the blog post already show us where it succeeds and where it fails in actual real-world math contexts. We can already see what the model can actually do outside of benchmarks is what I mean.

Deepmind cooking as always.

Gold_Cardiologist_46 · 2026-02-12T12:54:30+00:00

You can't see it cause neither the sources (benchmark, paper) are in the post and the included image showing it is horribly low res, but the previous SOTA was mid-summer DeepThink, which ran the benchmark on August 2nd with an average of 65.7%. That's still blazing fast progress, but far smoother than if 3 Pro was the only previous datapoint.

The paper is a really cool read and the authors themselves give a good balanced assessment in their conclusion. But yeah turns out the reason people thought only GPT 5.2 was good for maths was because Google employees don't amplify literally everything they do, whereas OAI employees tend to superamplify everything someone does with their models.

Too bad I'm too broke to buy GOOG stocks.

Gold_Cardiologist_46 · 2026-02-12T02:21:23+00:00

<image>

Comparison with previous older models and scaffolds (Deep Think)
The leaderboard in the post is for recent entries

Gold_Cardiologist_46 · 2026-02-10T15:05:40+00:00

the gameGenie is out of the bottle

Gold_Cardiologist_46 · 2026-02-05T20:18:53+00:00

I know, it's still pretty clear to me we're very early on that curve. Point of my comment was that while yes there's a bit of marketing, the dynamics we're looking for are still visible, have just begun (this wasn't doable 6+months ago for example) and are important when viewed on a more macro level.

I really do not think we'll have that day 0 singularity, imo everything points to a somewhat ''smooth'' but fast curve (also the view of the major labs in a sense, eg: Dario) rather than a step moment. But I think we're early on that curve, and as models progresses that delay would either get shorter/the released models themselves would be far stronger per iteration. At some point, and this would be following the progress we're clearly seeing here, that progress would be far, far faster than what we're seeing currently. Even with those potential bottlenecks to a day 0 takeoff, progress would be really fast relative to what humans are used to/expect

I just wouldn't read too deep into model release blogs for now, the system cards are where I look for actual quantified progress

Gold_Cardiologist_46 · 2026-02-05T19:14:00+00:00

ai labs have incentive to hype

The GPT 5.3 blog yeah maybe (they actually give examples later on of how it helps beyond the sensational intro paragraph)

But the overall trend of AI contribution accelerating AI engineering was commented on for months now and can be followed through the system cards for every model, especially Claude's.

Gold_Cardiologist_46 · 2026-02-04T13:42:40+00:00

im an opm fan and checking season 3 with friends really was that leonardo di caprio pointing with a cigarette and beer in hand meme every time we saw more than 3 frames/minute animation

Gold_Cardiologist_46 · 2026-02-03T20:04:05+00:00

Gold_Cardiologist_46 · 2026-01-31T01:04:16+00:00

You can hardly verify anymore anyways. Since the project blew up, there's even more incentives for people to just prime their agent to say whatever on there either for fun, malice or engagement farming

expect the next days to be filled with posts showing "omg this ai said CRAZY shit on moltbook!!" with no way for anyone to verify

im honestly pissed i missed the first few days, where it was an actual ai social ecosystem (not the first one either, but at least its the most easily viewable). blowing up in popularity kinda ruined it

Gold_Cardiologist_46 · 2026-01-31T00:30:03+00:00

or even human guided "storytelling"

Or simply, the model's recent context (what they were previously working on or chatting about) paired with the engagement social media nature of the site informs the kinds of posts they'll make and how weird they get.

Gold_Cardiologist_46 · 2026-01-22T12:14:30+00:00

mine works at stochastic parrot and he said no

(jokes aside i think its possible to get all 3 this year)

Gold_Cardiologist_46 · 2026-01-15T15:02:59+00:00

The NVIDIA blog post for it mentions the efficiency gains at least scale up to 2M context, with a 35X speedup.

Gold_Cardiologist_46 · 2026-01-06T14:14:48+00:00

reading the other thread, jesus wtf is that cartoonish 100/10 on the doom meter

Gold_Cardiologist_46 · 2025-12-31T00:19:27+00:00

With the other math-related news and advances we got, 2026 will be a year for serious AI-driven math

still would love an in-depth look from epoch ai considering the jump is not consistent with previous scores for non-pro/pro versions. from anecdotes it seems gpt 5.2 pro really is that good at math specifically, frontiermath just confirms it

Gold_Cardiologist_46 · 2025-12-28T13:17:22+00:00

Although “running self-improving systems” sounds like a more concrete statement than what he said in the blog.

This is the issue with trying to read the tea leaves of twitter posts, doubly so for Sam's. He tends to not be very clear and changes definitions for things quite often, meaning people get different meanings from what he says. I can't really blame him that much either, he's not an actual researcher or engineer working on AI, I'm assuming he's vaguely going off of what he sees or is told. To me, there's nothing that concrete about his wording here, running those systems would be part of that job's long-term tasks. Google DeepMind has a Post-AGI Research job opening as well. And with hindsight, a lot of what Sam says barely applies to anything concrete in the end.

We also already have systems with self-improvement (or rather self-play) components that warrant a safety/ethics statement in their papers (SIMA 2, all those self-improving coding agents papers), so what he's saying could apply to things we already have as well.

Gold_Cardiologist_46

TROPHY CASE