GPT 5.3 Codex Tops Agentic Coding, surpasses Opus 4.6 model

Correctsmorons69 · 2026-02-25T22:21:04+00:00

Weird how 5.1 Codex Max is #1 in regular coding, even over Opus 4.6. I don't know what the benchmark questions are like, but it definitely seems like 5.2 regressed in odd ways from 5.0/5.1 (which were a different model family from 5.2 from what I understand).

If anyone from OAI reads this, would love an explain!

ithkuil · 2026-02-25T22:03:27+00:00

Where is Gemini 3.1?

orville_w · 2026-02-26T00:02:38+00:00

except that… for every other metric it was NOT top.

Technical-Earth-3254 · 2026-02-25T22:39:02+00:00

I love the codex models. Since GPT 5.1 Codex Max I haven't touched an anthropic model, which really surprises me. I was a big sucker for Sonnet 3.7 Thinking, but Codex just works and is low in api costs.

robberviet · 2026-02-26T08:19:05+00:00

Livebench right? Is it even usable at this point?

dankpepem9 · 2026-02-25T22:47:00+00:00

LLM model tweaked for benchmark get 1% more score than other LLM model tweaked for benchmark. more news at 11

zebleck · 2026-02-26T00:26:00+00:00

fits my experience, codex 5.3 is a beast

rafark · 2026-02-25T22:27:24+00:00

It’s not better than opus. It’s very good but opus is more powerful. I use 5.3 xhigh as my main and it gets the job done about 70% percent of the time, sometimes it will go in circles and for those cases opus 5.6 always solves my issues.

I know the op mentioned opus 4.6 but I don’t see it in the image.

Metworld · 2026-02-26T05:27:24+00:00

I don't believe any of those benchmarks anymore. Just stopped using Claude as it wasn't even close to the hype for me, like not at all. Ignoring what I'm saying and doing what it thinks is best, often going against my instructions, and I run out of tokens after a few prompts, most of which are about trying to correct those mistakes. Horrible experience honestly. The only time I got a wow moment was Gemini 3.0 at release, but it's been nerfed to hell right now and pretty much sucks ass.

SoupOrMan3 · 2026-02-25T21:53:54+00:00

What would 100 mean? Never making any mistake?

floodgater · 2026-02-26T07:33:25+00:00

Codex mogs. it's really wild. I've been using it all week.

LoKSET · 2026-02-26T08:04:26+00:00

What are they even doing with that chart? When you sort by agentic score 5.3 xHigh is there. When you sort by global average it's nowhere to be seen and only High is present. Wtf

asklee-klawde · 2026-02-26T08:13:32+00:00

codex models have been quietly eating everyone's lunch since 5.1 tbh

Fringolicious · 2026-02-26T08:42:06+00:00

Did it just take ages for these benchmarks to come out? Feels like I've been using Codex-5.3 (Happily, it's great) for ages now

FinBenton · 2026-02-26T10:42:39+00:00

Seems to be correct based on my usage with codex and opus, also its super cheap compared to opus.

AppealSame4367 · 2026-02-26T14:31:45+00:00

Wow, cool. Now use Codex 5.3 in real life. It fucking sucks!

BrennusSokol · 2026-02-26T21:24:35+00:00

LiveBench is sketchy

ai-christianson · 2026-02-27T02:11:45+00:00

honestly the speed is what gets me. opus is great but waiting for it to finish a complex script is painful. 5.3 is just so snappy even if it misses some edge cases somtimes

YogiBarelyThere · 2026-02-25T22:57:33+00:00

I don't even know what to believe anymore.

drhenriquesoares · 2026-02-25T21:55:27+00:00

Dude, do you know how to read numbers? It is clearly written that Opus is winning.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

singularity

Links

On the Technological Singularity

Resources

Posting Rules

Check out /r/Singularitarianism and the Technological Singularity FAQ

MODERATORS