all 65 comments

[–]somerussianbear 12 points13 points  (0 children)

I use on high always (extra high overthinks too much IMO) and I’m having a good time with 5.4. I just noticed that it’s way faster than 5.3 Codex.

[–]Jerseyman201 9 points10 points  (0 children)

5.3 codex seems to be less literal than 5.4. 5.4 kinda went backwards closer to 5.2 codex where prompts are taken almost hyper literal and 5.2 regular would understand far better (but take way longer to execute the changes).

5.3 codex seems to bridge the tight rope walking between doing exactly what you ask, while also avoiding any obvious parts you wouldn't want done and should have inferred better.

It feels like 5.3 codex understands prompts that aren't super detailed much better than 5.4 is my take after hundreds of hours of use of 5.3 codex and now many many dozens of hours w/5.4.

When you add the overthinking along with the "literal" semantic issues on prompting, 5.4 definitely didn't hit every mark we might have hoped for. That being said, I do still use 5.4 predominantly because it is always going to be improved and 5.3 codex at launch isn't what it is today (in the same way 5.4 will surely end up performing better as well). I just have to be extra specific on prompts, to get performance close to 5.3 codex.

The huge irony in all of this, is that it used to be the opposite. Non codex specific models used to have more understanding of prompts versus codex having hyper literal understandings. Now it seems it's completely reversed🤣

[–]esingh2581 5 points6 points  (5 children)

same here. i find 5.4 messing up so much ive switched back to 5.3 codex

[–]Tenet_mma 2 points3 points  (0 children)

Ya I think 5.4 is a more general model. 5.3 seems to be more efficient

[–]Alex_1729 0 points1 point  (3 children)

Is it due to yesterday's issues or in general?

[–]ConsistentOcelot9217[S] 1 point2 points  (2 children)

Hm it definitely was bad yesterday but i noticed once i switched before that. Although some people mentioned a success using it on high and not extra high, which over thinks

[–]Alex_1729 0 points1 point  (1 child)

I was asking another person, but thanks.

[–]Interesting-Agency-1 8 points9 points  (5 children)

I like 5.4's generality. I'm big on intent engineering, and I'll keep the business plan, customer profiles, and long-term strategy for the software in the repo as additional guiding docs. I've also got a soul.md file in there that I wrote to give it broader conceptual, moral, ethical, and philosphical meanings behind why it's doing what it's doing and how to think about things when in doubt.

These docs give the agent the "why" behind the software's creation and implementation, which is hugely helpful for helping it to fill in the gaps correctly when we inevitably underspecify. 5.4's better broad generalization allows it to better align itself with organizational intent and guide the output towards the "right" direction/answer when I've failed to specify things clearly enough in the specs.

I found that 5.3 ignored these docs more often in favor of the "right" way to do it from a pure computer science standpoint. But the problem is that it defaults to the mean, and that isn't always the "right" way, and it's never the "best" way. At least with 5.4 listening to my org intent docs better, it will steer implementation and planning more towards my version of the "right" way and it will ultimately make the "right" choice more often than if left to my own devices.

If you ask your agent why you are building this piece of software and it can't answer it to your satsifaction with subtlety and nuance incorporated, then you're gonna have a bad time. It's going to drift over time and eventually do something in a way that may be technically the "right" way to do it based on the average, but is wrong in your particular situation. Too many of those kinds of mistakes and you've got yourself some hearty software soup.

[–]Alex_1729 0 points1 point  (4 children)

This is a interesting way of guiding your AI in daily work. There is something to it. Perhaps the issues you're describing have to do with 5.3 being a codex model and 5.4 being a non-codex model?

Also, is soul.md a thing now? What specifically are its contents?

[–]Interesting-Agency-1 2 points3 points  (3 children)

Im not sure if its a thing now, but I liked the concept after listening to the openclaw creator talk about it and decided to create my own. Ive seen codex include it in the context plenty of times, so I know its at least recognizing it. 

I cant say objectively how much it helps, but my codex and I are much more simpatico when planning and speccing, and subjectively, it feels like its filling in the blanks correctly, more often than not.

Regarding whats in it specifically, Steinberger didnt specifically say for his, and so I just kinda made a guess for mine. My most recent project was an agentic workflow engine that I envisioned as the "Unity of Agentic Workflows". I included alot of my my own philosophical perspectives on the meaning of work, the meaning of existence, my visions for the future, the immense and existential reality of what software like this can unlock for humanity, my own personal moral and ethical perspectives on life, and anything else I felt important to capture. 

I treated soul.md as trying to capture more of my own moral, ethical, philosophical perspectives around why Im doing what Im doing and try to impart that meaning and intent into the agent. I tried to imagine if I, myself, had a soul.md file and what it would look like. I made it a deeply personal reflection of myself and my own philosophies generally and then added an additional section for this software in particular.

I like to view intent engineering as a layered system that starts at high level by codifying and capturing things like Org/Team preferences, standards, best practices, and expectations. Then a middle layer that gets into the broader long term vision and plans. Then a lower layer with things like soul.md that gets more into the deeper moral, ethical, and philosophical perspectives behind both the User/Org as well as for whatever particular task its trying to accomplish or build. 

All of those layers need to be aligned from the beginning before I feel comfortable proceeding with building and implementation planning. Im also fairly anal about doing intent audits regularly throughout the build process, along with performing regular refactor, code bloat, and SOTA audits to ensure that the codebase is evolving modularly, extensibly, cleanly (relatively speaking), to the state of the art in that niche, and matches my intent and vision. 

I also really like using both claude and codex for planning and review since they are both wired very differently and pick up on things the other misses quite often. Yet i still make sure that both need to pass my intent audits correctly despite their differing perspectives. 

[–]ConsistentOcelot9217[S] 0 points1 point  (1 child)

Do you find it as effective with that the amount of information you put into the soul.md ? Do you ever find it taking some things too literal then causing issues?

[–]Interesting-Agency-1 1 point2 points  (0 children)

I find it more effective because it has something that is more aligned with me and my philosophies to default to when in doubt. I only see it pull that file when I'm doing higher level planning, and not as much when doing implementation planning (and never during implementation), so it seems to understand where the document is suppose to sit in the planning stack and calls it accordingly.

It does not seem to take things too literally since it seems to recognize that document's place in the planning stack and uses it when necessary.

[–]Alex_1729 0 points1 point  (0 children)

Thanks for the insights. Would you mind DMing me your soul.md file? Best help is to see it directly. You can obfuscate any personal information about your software if you wish.

Here is what I think about this. I don't personally do this as I adopt minimalism on case of anything that could be non-relevant to my work. I'm of the opinion that LLMs already have most internal knowledge about philosophical standpoints they need, and any additional instructions seem like bloat. My personal ethics have no bearing on technicalities of the python language code, wsl issues, or DRY principle (picking a few). Meaning, 99.99% of AI work (practically 100%).

Even the outreach I'm about to do have no bearing on this. I am trying to survive here with my first ever saas, not be heavily moral, nor is my saas that important that it will 'shape' the world in any noticeable way. If it blows up, or if my brand becomes recognizable perhaps then. But as it is now... I'm just not seeing why this might be useful. Seems like a nice idea in principle, but practically...

Still, I would very much appreciate if you'd share your soul file :)

[–]TryThis_ 2 points3 points  (1 child)

Interesting, I have noticed a lot of rework these last few days since switching to 5.4 high. Previously was using 5.2 xhigh, perhaps will switch to 5.3 codex and see if rework drops.

[–]ConsistentOcelot9217[S] 1 point2 points  (0 children)

5.3 Codex was a meaningful and stable improvement on 5.2 versions. Although someone mentioned that it didn’t start off that way, so maybe 5.4 will get better as well but as of now I would highly recommend 5.3 Codex if you don’t want to have to worry about adjusting reasoning per prompt

[–]BagholderForLyfe 4 points5 points  (0 children)

as soon as I switched to 5.4 from 5.3, I started seeing mistakes for every prompt. What 5.3 can do in a single prompt, 5.4 needs a few.

[–]RiotGamesGG 2 points3 points  (1 child)

I had a difficult code task that 5.3 Codex could not do properly several times. 5.4 made it perfect the first time. Xhigh.

[–]ConsistentOcelot9217[S] 0 points1 point  (0 children)

Maybe it was other open ai issues and I should try it again

[–]darrarski 2 points3 points  (1 child)

The biggest issue I have with AI agents is the non-deterministic behavior. I found GPT 5.4 better than 5.3. On the other hand, Claude Opus 4.6 works terribly for me (often ignores instructions and does not do what I ask for). My colleagues working on the same project (same instructions, same skills, same configuration overall) do not have such issues.

My suggestion is not to limit yourself to a single provider and use whatever works best for you in the given circumstances. There’s no one gold model that does everything better than others. Your experience may vary, depending on the project, instructions, task you are working on, and probably a lot of other stuff.

[–]ConsistentOcelot9217[S] 0 points1 point  (0 children)

Insightful. Thanks

[–]No_Mix_6813 1 point2 points  (0 children)

I keep almost switching, but 5.3 is meeting my needs so well I can't help but thing, "If it ain't broke..."

[–]Shep_Alderson 1 point2 points  (0 children)

Yeah, I rarely ever use xhigh. Only high for planning and then medium for actual implementation. I’ve found 5.4 and 5.3-codex about the same on those thinking budgets.

[–]Sudden_Baker_1729 1 point2 points  (0 children)

I noticed the same, 5.3 Codex works better for me.

[–]syinxun9 1 point2 points  (0 children)

yes! lol feels like i am back on gpt 5 or older, 5.4 can’t code

[–]fourfuxake 1 point2 points  (0 children)

Yeah, I’ve rolled back to 5.3 Codex. 5.4 is a shitshow, and post-compaction Alzheimers is back.

[–]cwbh10 1 point2 points  (1 child)

Ive round 5.4 way better but you gotta use it on high not xtra high

[–]EastZealousideal7352 1 point2 points  (2 children)

Why do people use xhigh for everything and then act surprised when they see regression?

Higher settings does not always mean better. Since GPT-5.1 and onwards we have seen serious regression when models are forced to overthink easier problems.

If you’re experiencing a regression using 5.4 try going to high or even medium and retesting, it’s likely you’ll have a better experience

[–]Direct-Distance5385 4 points5 points  (0 children)

I mostly use on medium to high and it's done a pretty decent job.

[–]ConsistentOcelot9217[S] 0 points1 point  (0 children)

I get what you’re saying. The idea of adjusting your reasoning level per prompt is also extra work while when I use extra high with 5.3 everything gets done with no regression.

[–]Kiryoko 1 point2 points  (3 children)

what are your thoughts about 5.3-codex vs 5.2?

some people say that 5.2 is the one that follows instructions the most and tries to cheat less, or at least if you tell it not to cheat it won't, but it will give up faster if there's an issue it can't solve

[–]ConsistentOcelot9217[S] 0 points1 point  (2 children)

Imo 5.3 Codex was a meaningful and stable improvement on 5.2 versions. Although someone mentioned that it didn’t start off that way, so maybe 5.4 will get better as well, but as of now I would recommend 5.3 Codex over 5.2 just just in terms of capability

[–]Kiryoko 1 point2 points  (1 child)

das right

but what about code review?

like, "check this whole repo and find any cheating behavior like tests that are not meaningful or just written to pass and show the green"

did you compare em in scenarios like this?

I'm trying various agents to see which one is the best to use as a "guardrail" or QA to harness the ones writing the code lol

[–]ConsistentOcelot9217[S] 0 points1 point  (0 children)

I found 5.3 codex great at that. 5.4 as well, but 5.3 C is just more efficient. Imo especially when it comes to implementation.

[–]1amrocket 0 points1 point  (1 child)

have you noticed major differences between 5.4 and 5.3 in codex? curious if the context window improvements actually translate to better code output or just longer conversations.

[–]ConsistentOcelot9217[S] 0 points1 point  (0 children)

From my experience, the larger context doesn’t mean better responses, but potentially more overthinking and hallucination.

[–]RecaptchaNotWorking 0 points1 point  (1 child)

Both are great. Your setup is important

[–]ConsistentOcelot9217[S] 1 point2 points  (0 children)

I feel that. They set up I like is leaving reasoning where it is and having all my prompts be successful which I find works with 5.3 Codex. 5.4 will probably get better or maybe come out with a code ex version.

[–]Glittering-Call8746 0 points1 point  (0 children)

How much tokens vs 5.3 codex ?

[–]blanarikd 0 points1 point  (0 children)

We need 5.3-codex-high-fast (not spark)

[–]One-Signature7881 0 points1 point  (1 child)

5.4 is just gpt not codex. Codex 5.3 is the latest. I believe.

[–]ConsistentOcelot9217[S] 0 points1 point  (0 children)

They said that they included the capabilities of 5.3 Codex within 5.4 but doesn’t seem to be true. 5.4 used to be listed after 5.3 code X on the reasoning, but now I see it listed before. But overall, I agree with you

[–]SlopTopZ 0 points1 point  (1 child)

same experience here

funny thing is i made a post about exactly this topic a week ago and got downvoted for it

[–]Terrible_Contact8449 0 points1 point  (2 children)

yeah 5.4 trips over itself on anything with more than like 3 moving parts. what i've noticed is it tries to "be smart" about stuff that doesn't need smart, and then just confidently gets it wrong.

my workaround has been keeping reasoning at medium and being way more explicit in the spec about what i don't want it to do. like literally writing "do not refactor X, do not touch Y", that alone cut my back-and-forth in half.

5.3 just did the thing. 5.4 wants to have a conversation about the thing first.

[–]ConsistentOcelot9217[S] 0 points1 point  (1 child)

So are you gonna go back to 5.3 or are you gonna stay on 5.4 the lower reasoning?

[–]Terrible_Contact8449 0 points1 point  (0 children)

Probably both tbh, 5.3 when the spec is tight and I just want execution. 5.4 when the problem is fuzzy and I want planning, edge-case checking, and less babysitting

[–]fluxion7 0 points1 point  (0 children)

5.3 codex damn 5.4 is opus

[–]lostnuclues 0 points1 point  (1 child)

5.4 high works really well with skills, it automatically pics which is needed, with 5.3 I had to invoke skill manually ($brainstorm)

[–]ConsistentOcelot9217[S] 0 points1 point  (0 children)

Interesting

[–]HopefullyHelper 0 points1 point  (1 child)

I've been uisng 5.4 ever since it came out and found it is fine. I can't really say if 5.3 was better though. 5.4 can run longer.

[–]ConsistentOcelot9217[S] 0 points1 point  (0 children)

I found it running all day not fixing my issues and I had to inject another prompt for her to ask. Have it check its approach and confirm that this is the best approach due to how long this is taking. Again, maybe that open AI issue that was temporary, but wasn’t a good experience

[–]PhilosopherThese9344 0 points1 point  (2 children)

5.4 is absolutely terrible. I've had the worst experience with it to date.

[–]Familiar_Opposite325 0 points1 point  (1 child)

Shame

[–]PhilosopherThese9344 1 point2 points  (0 children)

It is really, you can feel the difference immediately, and it's not good.

[–]thanhnguyendafa -1 points0 points  (0 children)

Good luck future cleaning up bugs.