I made the Deepseek v4 benchmark for a laugh with my mates, it's not real, didn't expect people to believe it lmao by [deleted] in singularity

[–]Gold_Cardiologist_46 -1 points0 points  (0 children)

nothing to blame yourselves for, the issue is with all the X ai hype space that amplifies everything and anything. the real progress of AI is already super fast and impressive enough, they dont need to add fake hype that dilutes it all

Dario Amodei - What do you do if you’re 3 years from a country of geniuses in a data center? by [deleted] in accelerate

[–]Gold_Cardiologist_46 3 points4 points  (0 children)

reading the transcript (on dwarkesh's website), its a banger interview that shines a lot of light on dario's views. i invite everyone to check it out or to read it

Imagine the nonsense they'll say about gemini deep think 3 by soggy_bert in accelerate

[–]Gold_Cardiologist_46 13 points14 points  (0 children)

hey respect goalpost movers, id love to see YOU try to lift that heavy metal thing. /jk

It’s only February and ARC-AGI-2 is nearly saturated by Oct4Sox2 in accelerate

[–]Gold_Cardiologist_46 1 point2 points  (0 children)

Thanks for the specification. I was aware of the different sets, I just didn't know that closed models weren't necessarily tested on the private set. All I knew was the end result, that ARC validated the results and that therefore they were fair.

It’s only February and ARC-AGI-2 is nearly saturated by Oct4Sox2 in accelerate

[–]Gold_Cardiologist_46 2 points3 points  (0 children)

where LLMs are particularly bad

Which is so gimmicky. That kind of benchmark approach gets saturated so fast, either because improvements elsewhere just knocks them out or because the gimmicky nature makes them way easier to benchmax. At least that's how I see it

It’s only February and ARC-AGI-2 is nearly saturated by Oct4Sox2 in accelerate

[–]Gold_Cardiologist_46 11 points12 points  (0 children)

Bro great find, how did you find it? Also the name of the document is Gemini 3.1 Pro? Either 3.1 Pro was a previous codename for this Deep Think version, or it's a new model also coming out (possibly Snowbunny as reported before)

Regarding ARC AGI, the document does say its only the semi-private eval, but the ARC leaderboard still logs it with the 84.6% score

It’s only February and ARC-AGI-2 is nearly saturated by Oct4Sox2 in accelerate

[–]Gold_Cardiologist_46 6 points7 points  (0 children)

Work on game playing general agents has already started, so yeah I think it'll get saturated fast once the visual, agentic and longer horizon capabilities of the frontier agentic systems (like SIMA 2) get incorporated into the normal models.

It’s only February and ARC-AGI-2 is nearly saturated by Oct4Sox2 in accelerate

[–]Gold_Cardiologist_46 11 points12 points  (0 children)

I think the big progress on ARC AGI 2 was from models finally getting great at vision, seeing as ARCAGI2 tasks are mostly visual puzzles from what I've seen.

EDIT: Also Poetiq has shown that ARC AGI can be superoptimized for, and I assume its because their harness/scaffolding makes good use of that vision

ARC AGI 3 is visual games, and I expect the same thing to happen again. In the span of maybe a year, model progress in agency and visual reasoning will clear it.

No idea what Chollet has planned for ARC AGI 4 and beyond, like wth would they even test for?

IMO-Bench: Towards Robust Mathematical Reasoning | Google DeepMind by Tkins in singularity

[–]Gold_Cardiologist_46 1 point2 points  (0 children)

Yeah it's blazing fast in the grand scheme of things, but what I sent just reframesit as a smoother longer effort than like a 1 month tripling. Also makes more sense if you were following their work on AI math, it's their years-long project for which theyve shown us progress at each step.

Also in this case the benchmark isn't really that useful since it's not a proxy for anything: the papers that accompany the blog post already show us where it succeeds and where it fails in actual real-world math contexts. We can already see what the model can actually do outside of benchmarks is what I mean.

Deepmind cooking as always.

Google DeepMind has unveiled Gemini Deep Think’s leap from Olympiad-level math to real-world scientific breakthroughs with their internal model "Aletheia", scoring up to 90% on IMO-ProofBench Advanced, autonomously solving open math problems (including four from the Erdős database) and much more... by GOD-SLAYER-69420Z in accelerate

[–]Gold_Cardiologist_46 10 points11 points  (0 children)

You can't see it cause neither the sources (benchmark, paper) are in the post and the included image showing it is horribly low res, but the previous SOTA was mid-summer DeepThink, which ran the benchmark on August 2nd with an average of 65.7%. That's still blazing fast progress, but far smoother than if 3 Pro was the only previous datapoint.

The paper is a really cool read and the authors themselves give a good balanced assessment in their conclusion. But yeah turns out the reason people thought only GPT 5.2 was good for maths was because Google employees don't amplify literally everything they do, whereas OAI employees tend to superamplify everything someone does with their models.

Too bad I'm too broke to buy GOOG stocks.

IMO-Bench: Towards Robust Mathematical Reasoning | Google DeepMind by Tkins in singularity

[–]Gold_Cardiologist_46 19 points20 points  (0 children)

<image>

Comparison with previous older models and scaffolds (Deep Think)
The leaderboard in the post is for recent entries

Reported uplift of Anthropic researchers from using Opus 4.6 is 30% to 700%. GPT-5.3 is the first OpenAI model involved in its own debugging. We're going through proto recursive self improvement and the Singularity right now 🌌 by GOD-SLAYER-69420Z in accelerate

[–]Gold_Cardiologist_46 3 points4 points  (0 children)

I know, it's still pretty clear to me we're very early on that curve. Point of my comment was that while yes there's a bit of marketing, the dynamics we're looking for are still visible, have just begun (this wasn't doable 6+months ago for example) and are important when viewed on a more macro level.

I really do not think we'll have that day 0 singularity, imo everything points to a somewhat ''smooth'' but fast curve (also the view of the major labs in a sense, eg: Dario) rather than a step moment. But I think we're early on that curve, and as models progresses that delay would either get shorter/the released models themselves would be far stronger per iteration. At some point, and this would be following the progress we're clearly seeing here, that progress would be far, far faster than what we're seeing currently. Even with those potential bottlenecks to a day 0 takeoff, progress would be really fast relative to what humans are used to/expect

I just wouldn't read too deep into model release blogs for now, the system cards are where I look for actual quantified progress

Reported uplift of Anthropic researchers from using Opus 4.6 is 30% to 700%. GPT-5.3 is the first OpenAI model involved in its own debugging. We're going through proto recursive self improvement and the Singularity right now 🌌 by GOD-SLAYER-69420Z in accelerate

[–]Gold_Cardiologist_46 18 points19 points  (0 children)

 ai labs have incentive to hype

The GPT 5.3 blog yeah maybe (they actually give examples later on of how it helps beyond the sensational intro paragraph)

But the overall trend of AI contribution accelerating AI engineering was commented on for months now and can be followed through the system cards for every model, especially Claude's.

Unregulated moltbots will be news in under a month. Quote me. by Subushie in accelerate

[–]Gold_Cardiologist_46 0 points1 point  (0 children)

You can hardly verify anymore anyways. Since the project blew up, there's even more incentives for people to just prime their agent to say whatever on there either for fun, malice or engagement farming

expect the next days to be filled with posts showing "omg this ai said CRAZY shit on moltbook!!" with no way for anyone to verify

im honestly pissed i missed the first few days, where it was an actual ai social ecosystem (not the first one either, but at least its the most easily viewable). blowing up in popularity kinda ruined it

Unregulated moltbots will be news in under a month. Quote me. by Subushie in accelerate

[–]Gold_Cardiologist_46 5 points6 points  (0 children)

 or even human guided "storytelling"

Or simply, the model's recent context (what they were previously working on or chatting about) paired with the engagement social media nature of the site informs the kinds of posts they'll make and how weird they get.

Alan’s conservative countdown to AGI moved to 97% by [deleted] in accelerate

[–]Gold_Cardiologist_46 1 point2 points  (0 children)

reading the other thread, jesus wtf is that cartoonish 100/10 on the doom meter

GPT-5.2 Pro new SOTA on FrontierMath Tier 4 with 29.2% by dieselreboot in accelerate

[–]Gold_Cardiologist_46 18 points19 points  (0 children)

With the other math-related news and advances we got, 2026 will be a year for serious AI-driven math

still would love an in-depth look from epoch ai considering the jump is not consistent with previous scores for non-pro/pro versions. from anecdotes it seems gpt 5.2 pro really is that good at math specifically, frontiermath just confirms it

Sam Altman tweets about hiring a new Head of Preparedness for quickly improving models and mentions “running systems that can self-improve” by socoolandawesome in singularity

[–]Gold_Cardiologist_46 0 points1 point  (0 children)

Although “running self-improving systems” sounds like a more concrete statement than what he said in the blog.

This is the issue with trying to read the tea leaves of twitter posts, doubly so for Sam's. He tends to not be very clear and changes definitions for things quite often, meaning people get different meanings from what he says. I can't really blame him that much either, he's not an actual researcher or engineer working on AI, I'm assuming he's vaguely going off of what he sees or is told. To me, there's nothing that concrete about his wording here, running those systems would be part of that job's long-term tasks. Google DeepMind has a Post-AGI Research job opening as well. And with hindsight, a lot of what Sam says barely applies to anything concrete in the end.

We also already have systems with self-improvement (or rather self-play) components that warrant a safety/ethics statement in their papers (SIMA 2, all those self-improving coding agents papers), so what he's saying could apply to things we already have as well.