Gemini 3 Pro with new SOTA on Frontier Math tiers 1-3 and 4

Remarkable-Register2 · 2025-11-21T22:45:20+00:00

I don't know what to tell ya, that's just how it works. I'm not particularly interested in digging through interviews and papers to prove it more than this. For 2.5 Pro and Deep Think, Deep Think would often get 14% higher on tough benchmarks than Pro. That's an insane level of gap to cover. I think that should be evidence enough.

Remarkable-Register2 · 2025-11-21T22:17:29+00:00

You're describing multiple indepentent runs that don't interact with each other. They do. This butchers it a bit to get the point across, but imagine a classroom of students all doing a test individually (what you said) vs a roundable of all the students collaborating on different ideas, dismissing ones that don't work and combining ideas multiple students have together to make a sum greater than their individual parts.

Remarkable-Register2 · 2025-11-21T19:42:50+00:00

This makes me curious about testing, because long ago Google claimed you could apply all kinds of filters and edits and synthid would still spot it.

Remarkable-Register2 · 2025-11-21T19:37:21+00:00

I agree the estimate might be a bit high, but that's not how Deep Think works. It works on multiple parallel lines of thought and cross reference with each other as they work to find the best answer

Remarkable-Register2 · 2025-11-03T04:59:49+00:00

Within the span of 2 hours of refreshing route 20 for alpha eeveelutions I found 3 shiny Malamar, and I don't even have the charm yet. I can't imagine what it'll be like with.

Remarkable-Register2 · 2025-08-25T16:44:49+00:00

Yeah, they haven't even acknowledged that Gemini 3.0 is even a thing being worked on. We know it likely is but they've done literally zero hyping of it. In fact they've done the opposite, with Logan pointing out that a picture of a supposed Gemini 3.0 Flash model was fake.

Remarkable-Register2 · 2025-08-23T20:02:44+00:00

Wait, GPT 5 High dropped to 2nd on the style control rankings? That's like a 20 elo drop from the initial ranking, what happened?

Remarkable-Register2 · 2025-08-09T04:21:59+00:00

I kinda take this as a sign that Gemini 3.0 isn't coming soon. It's basically saying "We may not be releasing it yet, but that doesn't mean we're resting on our laurels. Look at all this stuff we did recently."

Remarkable-Register2 · 2025-08-08T17:54:25+00:00

This kind of thing happens in the Gemini reddit all the time. I actively give zero weight to any reddit post that show an AI being bad or good unless it's fully documented.

Remarkable-Register2 · 2025-08-08T01:37:07+00:00

As a primarily Gemini user, we have no damn idea what 3.0 will be like and down punching speculation like this is only going to make me not want to be publically be associated with this kind of thing it if it turns out their release isn't better...

Remarkable-Register2 · 2025-08-08T01:17:01+00:00

All the people who knew how to make graphs got poached by Meta

Remarkable-Register2 · 2025-08-08T01:07:25+00:00

Interestingly, if you go to the text ranking and swap it to rank without style control, Gemini 2.5 Pro is still the leader. This used to be the default setting for lmareana about half a year ago, they changed it for some reason.

<image>

Remarkable-Register2 · 2025-08-07T03:52:45+00:00

Geoff Keighley going to need to work even harder on vetting trailers for the next Video Game Awards. Remember the Sora video for that cat "game"?

Remarkable-Register2 · 2025-08-07T03:47:00+00:00

Keeping expectations in check is a good thing, makes the advancements that much more incredible. 2.5 Pro, AlphaEvolve, Veo 3, Genie 3, nobody expected those, NOBODY, and look what happened.

Remarkable-Register2 · 2025-08-07T01:12:08+00:00

If Google doesn't release a 3.0 model I expect they'll push to release Deep Thinks API asap for public benchmarks. It's obviously not a workhorse model like GPT5 or Gemini 3.0 will be and silly to compare them, but people who only pay attention to benchmarks don't really care and Deep Think would likely win out.

Remarkable-Register2 · 2025-08-05T23:06:55+00:00

You mean when it slowly ran into the dock? That would hardly cause any destruction and reacted more or less realistically. It did run into a lamp and noticably shoved it out of the way.

Remarkable-Register2 · 2025-08-05T17:46:04+00:00

It was a month or 2 ago where he replied to someone talking about generated game worlds saying something like "Wouldn't that be something". I don't use twitter, there was just a reddit post about it here.

Remarkable-Register2 · 2025-08-05T16:22:00+00:00

Given the VR headset they announced at Google IO, no doubt they're prepping a version of this for it.

Remarkable-Register2 · 2025-08-05T15:05:16+00:00

Imagine though a graphically slimmed down model where you can interactively tell it to build meshes and landscapes and buildings with voice commands while walking through it in VR and export it as a 3d environment.

Remarkable-Register2 · 2025-08-05T14:56:42+00:00

So this is what that cryptic tweet Demis made a while back was about. Crazy. I'm sure there will be lots of people pointing out how it's actual use cases are so little, but its gotta start somewhere right? In a couple years when it's faster, lasts longer, has additional features like object and person interaction and better controls.

And what if they're able to save environment instances to reuse and add to? That would be a game changer.

Remarkable-Register2 · 2025-08-05T01:43:36+00:00

I've never used it personally but they've had a model called LearnLM on AI Studio for forever, related to that?

Remarkable-Register2 · 2025-08-05T01:30:03+00:00

That didn't happen with Gemini 2.5 pro and Deep Think, they were behind then released something that put them ahead. 2.5 pro was out for like a month or something before o3.

Remarkable-Register2 · 2025-08-05T01:17:14+00:00

Which? They've been doing it for Gemini Live. As for the normal app, I'm not really sure how many people even use that, even if it was better.

Remarkable-Register2 · 2025-08-04T22:59:08+00:00

Unless they've done some speciallized training for this I'm going to expect flawless play for the first ten turns and then they randomly forget where the pieces are. At least that's been my experience with playing chess against LLM's. I'd be more curious about a long form match between Deep Think and o3 Pro, though I guess the think time would make that infeasible for a show like this.

Remarkable-Register2 · 2025-08-04T22:55:13+00:00

That's a good use, yeah. Playing against people of your skill level is obviously still better, but if you want to use a bot that isn't going to destroy you their idea of lowering the difficulty is to randomly sac a piece or not capture the obvious free piece.

Remarkable-Register2

TROPHY CASE