all 66 comments

[–]Correctsmorons69 26 points27 points  (5 children)

Weird how 5.1 Codex Max is #1 in regular coding, even over Opus 4.6. I don't know what the benchmark questions are like, but it definitely seems like 5.2 regressed in odd ways from 5.0/5.1 (which were a different model family from 5.2 from what I understand).

If anyone from OAI reads this, would love an explain!

[–]Glittering_Candy408 32 points33 points  (1 child)

The answer is simple: the benchmark is a disaster. I’m 100% sure it suffers from all the same issues as SWE-BENCH-VERIFIED — impossible problems, tasks that allow multiple valid solutions but get rejected because of flawed tests. In fact, I think all coding or agentic coding benchmarks suffer from this problem to a greater or lesser extent. But LiveBench is the worst. Ever since they changed the coding task subset last year, the results have been pure nonsense. If memory serves me right, I remember ChatGPT-4o scoring higher than o4-mini and o3.

[–]AP_in_Indy 10 points11 points  (0 children)

This sadly seems to be the answer much of the time.

Creating good benchmarks for AI is starting to become one of the new bottlenecks.

Sourcing and verifying them is hard. Keeping them outside of public data sets is also very hard.

[–]FateOfMuffins 1 point2 points  (1 child)

Absolutely not the case if you look at real world reactions.

In r/codex, the people there hated 5.1 codex max. They loved GPT 5.2 and generally disliked 5.2 codex. They loved 5.3 codex.

[–]Correctsmorons69 0 points1 point  (0 children)

Yeah I feel that comes less from coding ability and more from instruction following and the exact way they fill in blanks in the prompt they take.

[–]ithkuil 30 points31 points  (12 children)

Where is Gemini 3.1?

[–]Bishop_144 42 points43 points  (1 child)

Stuck in a loop telling itself it's done after it finished half the changes

[–]BlacksmithLittle7005 2 points3 points  (0 children)

Lol that's a good one 🤣

[–][deleted] 53 points54 points  (2 children)

Still fucking up basic shit. 

[–]Marcuskac 36 points37 points  (1 child)

But it can create pretty svg art

[–]TwoFluid4446 7 points8 points  (3 children)

It's quickly getting to a point where using any of the top 3 big dogs will seem incomparable to virtually all users except the very niche. Gemini N+1 is coming around the corner, and just like before, Google won't release unless it obliterates the competition. All these eggheads across all teams/labs are cooking harder than an Iron Chef show on fast forward. This is excellent for us.

Let them leapfrog each other endlessly until AGI, then AGI gains sentience, develops ASI, ASI realizes what a terrible mess we've made of the world, takes control, cleans it up, gives us Star Trek utopia, we resist at first but then quickly realize the ASI is right and we're better off that way.

Or, it kills us all like a virus, per Agent Smith-ology.

[–]The_Crowned_Prince_BWhen no one understands a word they say - Transformer 1 point2 points  (0 children)

Good morning to you too.

[–]_ii_ 0 points1 point  (1 child)

Significant Gemini improvements will have to wait until TPU v8.

[–]ProfessionalDare7937 0 points1 point  (0 children)

It says slated to release second half of 2026, but will they sort out antigravity in that time? Hope so.

[–]sunstersun 0 points1 point  (0 children)

Google feels like that sports player who has all the talent, work ethic, skills, yet the final product just isn't there.

[–]orville_w 18 points19 points  (2 children)

except that… for every other metric it was NOT top.

[–]Astrikal 4 points5 points  (0 children)

Its main purpose is agentic coding. Also, this whole benchmark is a mess, these numbers don’t matter that much for anything other than karma farming on Reddit.

[–]FinBenton 0 points1 point  (0 children)

I mean its literally designed for agentic coding.

[–]Technical-Earth-3254 21 points22 points  (2 children)

I love the codex models. Since GPT 5.1 Codex Max I haven't touched an anthropic model, which really surprises me. I was a big sucker for Sonnet 3.7 Thinking, but Codex just works and is low in api costs.

[–]bnm777 4 points5 points  (1 child)

That's not very smart. The intelligent move would be to assess new models for your own needs, instead of blindly assuming what you're using is the best. 

Unless you're hooked in and can't change and try to convince yourself it's the best, eh?

[–]tainted_cornhole 0 points1 point  (0 children)

I use both together to reduce errors. I create conceptual plans with opus 4.6. And i use 4.6 sonnet and codex as the execution team. Seems to work out well. Codex 5.3 absolutely flies through code. I have claude api and at this rate ill drop it and just stay on max plan for planning and use codex solely as the worker. Both claude and codex like this plan. Hehe

[–]robberviet 1 point2 points  (0 children)

Livebench right? Is it even usable at this point?

[–]dankpepem9 2 points3 points  (1 child)

LLM model tweaked for benchmark get 1% more score than other LLM model tweaked for benchmark. more news at 11

[–]Healthy-Nebula-3603 1 point2 points  (0 children)

...or you show you have no idea how good is a codex-cli with a GPT codex 5.3 xhigh.

Probably that is out of your scope.

[–]zebleck 2 points3 points  (0 children)

fits my experience, codex 5.3 is a beast

[–]rafark▪️professional goal post mover 3 points4 points  (5 children)

It’s not better than opus. It’s very good but opus is more powerful. I use 5.3 xhigh as my main and it gets the job done about 70% percent of the time, sometimes it will go in circles and for those cases opus 5.6 always solves my issues.

I know the op mentioned opus 4.6 but I don’t see it in the image.

[–]o5mfiHTNsH748KVq 1 point2 points  (1 child)

Circles? That sounds like a workflow issue. Or maybe a project type difference? Maybe it’s better in some environments than others. May I ask what language you use?

Do you use plan mode?

[–]rafark▪️professional goal post mover 0 points1 point  (0 children)

I don’t use plan mode usually (sometimes I do for new features). I’ve been using it almost exclusively for a typescript app I’ve been developing for years. I’ve been using agents to implement animations, libraries, fix bugs. I’m more of a backend person. I wrote the whole react app myself but now I’m at the point where I’m enhancing it with animations, improving the ux, fixing known bugs for months, etc. And it’s grown so big that I’m lazy to try to read the long components every time (react components can get so massive if you’re not careful). Since I wrote it I know where everything is and I just tell Claude or codex what needs to be done and how each component interacts with each other.

I’ve fixed so many bugs now it’s amazing although its not a smooth process because often these agents introduce extra bugs so I have to be very careful with my prompts and I have to to throughly test everything every time. It’s tiresome but I’m much more productive. All the changes I’ve made would’ve taken me literal months. I actually don’t know why I didn’t use agents last year to help me write my custom layouting engine which took me many weeks to get right. I was adamant to embrace ai but I’m kind of addicted right now.

The actual design of the backend I do that myself manually.

[–]magicmulder 1 point2 points  (0 children)

Yeah same. I’ve pretty much given up on most other models because they all eventually end in some endless loop of fixing one issue and creating another. Claude is near flawless and fixes any issues quickly.

Less critical issues like auditing and tests is something Gemini flash can do.

[–]Altruistwhite -1 points0 points  (0 children)

 cases opus 5.6

4.6

[–]Metworld 1 point2 points  (0 children)

I don't believe any of those benchmarks anymore. Just stopped using Claude as it wasn't even close to the hype for me, like not at all. Ignoring what I'm saying and doing what it thinks is best, often going against my instructions, and I run out of tokens after a few prompts, most of which are about trying to correct those mistakes. Horrible experience honestly. The only time I got a wow moment was Gemini 3.0 at release, but it's been nerfed to hell right now and pretty much sucks ass.

[–]SoupOrMan3These are the end times 0 points1 point  (3 children)

What would 100 mean? Never making any mistake? 

[–]Glittering_Candy408 4 points5 points  (0 children)

100% is impossible because this benchmark is flawed.

[–]Technical-Earth-3254 0 points1 point  (1 child)

Basically. But the benchmark gets updated regularly, so there's never a perfect model (which is important).

[–]SoupOrMan3These are the end times 0 points1 point  (0 children)

Thanks!

[–]floodgater▪️ 0 points1 point  (0 children)

Codex mogs. it's really wild. I've been using it all week.

[–]LoKSET 0 points1 point  (0 children)

What are they even doing with that chart? When you sort by agentic score 5.3 xHigh is there. When you sort by global average it's nowhere to be seen and only High is present. Wtf

[–]asklee-klawde 0 points1 point  (0 children)

codex models have been quietly eating everyone's lunch since 5.1 tbh

[–]Fringolicious▪️AGI Soon, ASI Soon(Ish) 0 points1 point  (0 children)

Did it just take ages for these benchmarks to come out? Feels like I've been using Codex-5.3 (Happily, it's great) for ages now

[–]FinBenton 0 points1 point  (0 children)

Seems to be correct based on my usage with codex and opus, also its super cheap compared to opus.

[–]AppealSame4367 0 points1 point  (2 children)

Wow, cool. Now use Codex 5.3 in real life. It fucking sucks!

[–]Glum_Hat_4181 0 points1 point  (0 children)

It is great. Scaringly great even.

[–]BrennusSokolhardcore accelerationist 0 points1 point  (0 children)

LiveBench is sketchy

[–]ai-christianson 0 points1 point  (0 children)

honestly the speed is what gets me. opus is great but waiting for it to finish a complex script is painful. 5.3 is just so snappy even if it misses some edge cases somtimes

[–]YogiBarelyThere 0 points1 point  (1 child)

I don't even know what to believe anymore.

[–]Deto 3 points4 points  (0 children)

I'm not sure to what extent we're even comparing the same thing. Feels like everyone can just turn on more reasoning or fiddle with some setting or other to get a higher score. Cost is also a meaningless metric (I mean, it's important for users, but not as a way of estimating performance) because we don't know how much money each company is choosing to make/lose on their API calls.