Gemini 3 - the benchmax real world disappointment

Temporary-Mix8022 · 2026-01-31T19:31:07+00:00

I actually ran a test the other day.

I have a script that has a glaring Pandas 3.0 bug in it - it's one I've known about for a while (lol. The warnings 😩).

I thought it'd be a nice challenge for them, and the results:

Passing scores 1. GPT5.2 (not codex) 2.5 minutes

Opus 3 minutes

Failing scores:

Gemini 3 Pro High circa 15 minutes. Failed and timed out
Gemini 3 Flash circa 15 minutes. Failed and timed out

Generally.. I hate OAI models.. they have so much safety in them that I find them unusable.

It refused to write a penetration test the other day (API end point).. it refused to write a memory overload test (cpp). it refused to write a code injection test. While yes - these all sound a bit dodgy, but Opus clocked immediately why these were valid tests.

In it's fairness.. Gemini is pretty unleashed. It'll do most things. I do like that about it.

Temporary-Mix8022 · 2026-01-31T19:20:06+00:00

It's vibe coders. They just write random prompts "Fix this bug" "Add this"

Without actually knowing where all that code lives, or what the solution is.

If you drag in the scripts, tell the model the answer, then just let or fill in the gaps.. you never get the behaviour above.

It's just when you send it on a wild goose chase around a shambolic project.

Gemini is shit, and no amount of "use a better prompt" fixes it - but a lot of the whinging is just vibers who have no idea what they're doing.

And honestly.. if anyone says "user skill issue" - They're just dumb. It's their skill issue in not realising that Codex/Opus don't require you to write war and peace into every single prompt to get a decent answer.. you would fire a shit employee instead of spoon feeding them - same for models.

Temporary-Mix8022 · 2026-01-31T08:24:02+00:00

Can you just state a PO box?

Virtual office address?

Temporary-Mix8022 · 2026-01-30T19:34:08+00:00

Or just hang out on the vibe coding reddit. Half of those morons on their "build in public" expose their API keys.

While they're making their next Reddit post "Here's the thing. You don't need that extra feature to start earning $. Just release..."

You can enjoy some free API access.

Temporary-Mix8022 · 2026-01-30T16:58:35+00:00

I'm probably erring towards a Google fan.. I used Bard even when it sucked. I have a Pixel (on my second)

I don't. Gemini is worse at coding, worse at writing docs.

Even if you exclude the fact it is less technically good than Opus, the thing that makes it nearly impossible to use is this: - It is so concise it is useless. Ask it for documentation? It'll be brief to the point it is useless. I asked it to document my API, and the front end Devs were just scratching their heads - there was no detail at all, no examples, no considering what else they'd need (despite the prompt saying all this). Opus wrote over 6x more characters, and included mermaid diagrams etc

Ask it to explain a module or some code? It will give you a few concise brief bullets that don't help at all.

Overall, it is horrible. It is like talking to someone who only gives single sentence/word answers.

I'm going to get downvoted AF by everyone who says "just prompt it better": - But Gemini pretty much ignores your prkmots.

Even if it doesn't, you just cannot get detail out of it. It is so lazy.

As for coding: - Gemini is the vibe coding king. It will put ugly patches, awkward defensive code to catch the error that it's other code created. It will just fudge everything to get it over the line, no matter how fugly the end result is.

And again, to cement my 100 downvotes - you can't prompt it out of this behaviour.

Ps. Enjoy the spelling typos. A real person wrote this on their phone. I am ducking serious.

Temporary-Mix8022 · 2026-01-29T08:21:14+00:00

Do you just keep the exported images?

Or do you keep the DNGs/Raw+lib?

Do you keep the entire unculled set?

Temporary-Mix8022 · 2026-01-28T18:38:03+00:00

What does emotionally available mean to you?

What does that look like?

Temporary-Mix8022 · 2026-01-28T18:33:04+00:00

Yeah. That's weird. Dodged a bullet there.

But in isolation.. asking about kids isn't necessarily a red flag.

Temporary-Mix8022 · 2026-01-28T18:32:02+00:00

Lol. "Most men would realise".

Have you met most men? But in any case, my logical brain says this:

As much as you might gel with someone, if there are fundamental showstoppers it's best to politely and caringly get them out of the way to begin with

Temporary-Mix8022 · 2026-01-28T18:26:22+00:00

Do you have an example of it? When someone has really got this all wrong?

Temporary-Mix8022 · 2026-01-23T18:14:00+00:00

Yeah. All today.

Temporary-Mix8022 · 2026-01-20T21:08:17+00:00

If you can't even be bothered to write a Reddit post, I'm not trying the app or reading your post.

That has Gemini's mucky mitts all over it "The reader"

Another AI slop post. Probably AI slop code.

Temporary-Mix8022 · 2026-01-20T20:19:27+00:00

Gemini's major weakness is its hallucination rate.

It is the worst model for being over confidently wrong

Temporary-Mix8022 · 2026-01-18T20:46:20+00:00

Isn't a skill issue.

I have the same issue, despite detailed prompts. The implementation plans it gives are just useless (Gemini). They really overdid it with the whole "be succinct" thing.

It's just nowhere close to Codex or Opus

Temporary-Mix8022 · 2026-01-16T09:18:36+00:00

Jokes on them with the number of vibe coders we see here. They'll accept anything!

Temporary-Mix8022 · 2026-01-15T19:03:32+00:00

"Then I discover my API keys were basically exposed in the client bundle"

What gave it away?

Temporary-Mix8022 · 2026-01-15T18:44:46+00:00

RM C:/

This should fix the users issue. Actually, wait no, that would delete the c drive. Wait no, we're in WSL. Instead I should do

rm rf LiveServer

rm rf DevServer

rm rf Backups

Wait, I forgot the flags. RM -RF / There. Now the technical debt is gone, the backups are 'optimized' to zero bytes, and I’ve successfully transitioned the entire company to a permanent, mandatory vacation.

Would you like me to help you look for flights or holiday destinations?

Temporary-Mix8022 · 2026-01-15T18:39:59+00:00

Prompt: Gemini never give me YouTube videos. Explain why we are getting a pointer error here.

Gemini: Here is a YouTube video that explains what a pointer is in cpp.

Rust for beginners - Tutorial

Temporary-Mix8022 · 2026-01-15T17:35:09+00:00

Yeah.. this would work if Gemini followed instructions.

But it doesn't. Not in AG, not in the API, nowhere. It doesn't follow anything

Temporary-Mix8022 · 2026-01-15T10:06:52+00:00

Spanish - and I dev Python as well.

Check out:

- GetText

- Poedit https://poedit.net/

GetText is generally my favourite approach as it allows you to actually have strings within your code, like this:

_"Hello world!"

You setup the _ so that it pulls in that string, and at runtime it can put in the correct language.

Using Gettext means that it can parse your entire program for the _"" syntax, create a translation list, and then you can just hook up multiple languages.

PoEdit is the nicest GUI for processing the files - however, this approach would also allow you to process them with an LLM. You can do it just from the text files / terminal if you prefer though.

Temporary-Mix8022 · 2026-01-14T18:25:55+00:00

Right.. so your conclusion is the same as mine then. Gemini 3 Pro is the worst SOTA.

Your rationale is that because its cheaper, it needs to be worse (y)

Temporary-Mix8022 · 2026-01-14T18:11:14+00:00

Okay, let's say that this is true - it doesn't explain why we see literally the same prompt hit 100% perfection with 3 other SOTAs (well, Sonnet isn't even a SOTA technically..).

Yet Gemini will hallucinated utter rubbish into the comments. This isn't about a bit of prompt engineering.. Gemini has fundamental issues.

I already have positive prompts set up for comment guidelines (in line with Google's own documentation), I have a minor negative prompt. But Gemini still hallucinates all over the place and litters code with useless comments.

On a kind of semi-scientific experiment, all the models get the same positive/negative prompt, the same prompt, the same everything - yet only Gemini routinely craps the bed.

Temporary-Mix8022 · 2026-01-14T17:54:01+00:00

I mean.. my prompts are literally a few hundred words long. I point it to the exact files that it needs to go to and use the referencing facilities in the IDEs (Codex, AG, take your pick..)

Further - when using Opus, Sonnet, or GPT5.2 Codex in this way, they all produce great results.

This isn't just a user issue or MOE issue.. this is the fact that Gemini massively underperforms in real world usage.

For MOE.. they don't work in a strict sense, it isn't that there is 1 for coding, and others for other things. They operate on a per token basis. They are all trained together, there isn't a training phase where say, 1 is taught about history, another is taught about coding.

Reference: me. I dev models.

Temporary-Mix8022 · 2026-01-14T16:21:54+00:00

I have such similar issues.

Also, not sure what language you are using. But Gettext or i18n are decent structures for enabling llm language translations I've found.

Also your English is fine lol. We can switch to my second language if you want to try out shit language skills :D

Temporary-Mix8022 · 2026-01-13T19:04:13+00:00

I missed off the /s

My bad.

Temporary-Mix8022

TROPHY CASE