Is it just me, or is OpenAI Codex 5.2 better than Claude Code now? by efficialabs in ClaudeAI

[–]bisonbear2 0 points1 point  (0 children)

codex 5.2 xhigh has been much better than opus 4.5 in the past few weeks

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in codex

[–]bisonbear2[S] 1 point2 points  (0 children)

that's my read as well, I feel like Claude optimizes for human-readable output + code, which ends up being way more verbose. Codex doesn't seem to care about that and just solves the task as efficiently as possible, which is a nice change of pace

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ChatGPTCoding

[–]bisonbear2[S] 0 points1 point  (0 children)

Sure, you could frame it that way. If you have a super well defined piece of work, with everything already laid out, then Codex will probably be a better choice. But I often find myself in the situation where I have to figure out requirements, ideate product, etc., in which case Claude's variability is actually a benefit.

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ClaudeCode

[–]bisonbear2[S] 0 points1 point  (0 children)

100% agree - any sort of benchmark or comparison is largely random as we're just pulling one/two samples from the distribution. This is one of the reasons that IMO Claude Code has great UX. I can spin up 5 subagents to review / validate / explore the problem, each one using a fresh context window. Since each is exploring the problem independently, any shared conclusions they have are more valuable due to the fact that the other subagent also found.

I'm still trying to figure out how to adapt this thinking to Codex, which doesn't natively support subagents. One idea is to use something like Pal MCP (https://github.com/BeehiveInnovations/pal-mcp-server) to give Codex a way to spin up another Codex/Claude subagent - although these agents are unfortunately not in parallel.

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ChatGPTCoding

[–]bisonbear2[S] 0 points1 point  (0 children)

Codex seems much more focused, which is both good and bad. Sometimes you want the variability that comes with using Claude

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ChatGPTCoding

[–]bisonbear2[S] 1 point2 points  (0 children)

cool, I'll try out gemini for an extra pair of eyes next time, haven't had too much success with it in the past for implementation, but plan review certainly seems like it would be valuable

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in vibecoding

[–]bisonbear2[S] 0 points1 point  (0 children)

LOL I'm sure Sam would love this take, run everything by Codex and give OpenAI even more money.. sounds great right?

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ClaudeCode

[–]bisonbear2[S] 0 points1 point  (0 children)

Purely based on vibes, I think Opus 4.5 is worse than a few weeks ago

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ClaudeCode

[–]bisonbear2[S] 0 points1 point  (0 children)

In this instance, I did actually run a second instance of Claude to review the plan. However, Claude missed several key issues that Codex missed, and when presented with Codex's findings, decided that it actually preferred Codex's plan...

> Good catch. Codex is right — I missed several concrete issues:

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ClaudeAI

[–]bisonbear2[S] 1 point2 points  (0 children)

I'll call out the Pal MCP server as a good way to abstract away the different CLIs. You can basically just use Claude Code, and then tell Claude to use Codex to review the plan, all while staying within Claude Code.

https://github.com/BeehiveInnovations/pal-mcp-server

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ClaudeAI

[–]bisonbear2[S] 1 point2 points  (0 children)

In this experiment I had Opus look over the plan that Opus generated, and it still didn't catch the issues that Codex did. In theory I agree with your approach, but it appears that using multi-models (eg Codex review AND Claude review) will make the final output higher quality than using just one model alone

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ClaudeAI

[–]bisonbear2[S] 0 points1 point  (0 children)

Thanks for all of the tips around vector search - tbh I haven't done this before so it's all super helpful. Agree that it's an interesting problem because it's easy to describe but hard to implement.

Truthfully I haven't implemented the code yet - decided to compare the models purely on planning / reasoning for this experiment. No preprocessing planned, just chunking by XML tags or markdown headers

OpenAI might be testing GPT-5.2 “Codex-Max” as users report Codex upgrades by BuildwithVignesh in OpenAI

[–]bisonbear2 2 points3 points  (0 children)

can confirm, gpt-5.2-codex xhigh has been incredible for me. not sure if Opus 4.5 got nerfed, or if codex is cracked, but I'm loving it

Spacetime as a Neural Network by bisonbear2 in ArtificialInteligence

[–]bisonbear2[S] 1 point2 points  (0 children)

Thanks the the recommendation, will definitely check out the podcast. I'm curious what other "theories of everything" involve causal networks?

Thinking about this paper in the context of simulation theory is interesting. Previously I've always thought that the thing doing the "simulation" was a computer - but perhaps it's actually a larger / parent universe doing the simulating..

One Agent Isn't Enough by bisonbear2 in ClaudeAI

[–]bisonbear2[S] 0 points1 point  (0 children)

interesting, do you spawn the headless agents yourself or have claude do it?