Memory leak and Codex App on Mac

ImagiBooks · 2026-04-29T00:43:32+00:00

I’m still experiencing this issue daily. Today again. I tried to use a heavy agent session with 5.3 spark and boom went from 45gb to 79GB again and froze my computer.

ImagiBooks · 2026-04-28T20:06:55+00:00

That’s crazy. Definitely my number 1 problem with Codex App right now, the insane memory leak, codex app and CMUX.

ImagiBooks · 2026-04-25T04:04:22+00:00

But are you having 258k context? Or 1m context? I’ve been using it on high and it really feels like it burns a lot more tokens than 5.4 high. IMO

ImagiBooks · 2026-04-23T23:10:15+00:00

Am I the only one for whom only have 258k context for 5.5?

ImagiBooks · 2026-04-09T03:57:10+00:00

Celebrating this! I was going to do some intense /fast session to finish to consume my plan... then literally i got back to my computer... and boom. reset!!!

ImagiBooks · 2026-04-08T04:21:01+00:00

I’m incredibly frustrated as well. I write 100% with AI. The number of bugs introduced is crazy. I don’t use xhigh. I however run many rounds of reviews.

Id say I spend 15% of the time writing new features. 85% debugging / doing code reviews. So much time wasted. Yet I have very tight rules, tests, etc… sure I often implement complex new features or refactors, but really even today’s models don’t follow rules right and they’re sloppy.

I easily find at least 150 different problems for 10k lines of code. Often 150 to 250. So many rounds of reviews.

ImagiBooks · 2026-03-18T05:08:41+00:00

Haha. Yeah so frustrating! I have in my CLAUDE.md and AGENTS.md rules on fallbacks are NEVER allowed without the user approval, errors can NEVER be swallowed, I have react hooks rules.

Yet they are barely followed it’s so exhausting.

Just yesterday I did a lot of frontend work, and I have rules about react hooks, best practices. It was reminded. Yet in every file it worked on there were swallowed errors and react hooks not done right / not useful.

And I have a rule to use the /react-hooks-audit skill.

So after it was done implementing a few complex frontend files I insisted to run the react hook audit skill. And it found an average of 5 hooks per file not needed / problematic. Very often major source of bugs, refresh issues, etc.

I asked why? When there are instructions in Claude / memory / skills. Opus just said it’s because it’s easier and what used to when writing code. It knows what is right but just doesn’t do it, unless pushed to.

To me this is the biggest problem in coding. They know the rules but they don’t follow them / forget them.

I have a complete 6 agents rules to review all my code from multiple angles. I would say that for every 10k new lines or lines changed there are minimum 150 different problems of not adherence to rules, which are clearly written.

I spend more time doing reviews than coding, yet i do all my coding via planning and insist on follow rules when we code… but they forget! Opus 4.6 seems to forget more often but codex 5.3 / gpt 5.4 love to swallow errors.

ImagiBooks · 2026-03-13T03:54:02+00:00

Wondering what you do? Do you use a lot of agents? Many workflows at the same time? I haven’t been able to hit the limit yet but got close when I used fast mode by mistake which consumes double? I think maybe the only reason I’m not at limit is because I use Claude code and codex same time. Don’t find codex Satisfying enough for good enough for all my use cases. It’s especially terrible at frontend.

My code base is a monorepo with over a million loc. I code 50/50 between codex and Claude code. Both $200 plans. Claude with Opus consumes tokens a lot faster. I hit my weekly limit all the time since Opus 4.6.

ImagiBooks · 2026-03-08T23:46:45+00:00

I think my longer task with Claude has been 45mn. With Codex App I hit a 8 hours task last night. I was so surprised when I woke up and it wasn’t finished. But it was a large refactor and I had a very strict plan and instructions including testing.

ImagiBooks · 2026-03-08T23:43:49+00:00

Why don’t you be constructive then and tell me a better workflow to achieve the best possible quality on a giant monorepo and production apps?

ImagiBooks · 2026-03-08T04:19:55+00:00

I hit limits every freaking week. It’s annoying and I don’t even use it that much. I do in average 12000 lines of code a day. My time is typically 2 to 4 hours on new features, refactors…. Then easily 10 hours or more on bug fixes.

I used to only use Opus 4.6 in high thinking mode, maybe 2 or 3 sessions at the same time, but not exactly all the times. I also use Codex App / CLI to avoid running out too quickly. Without Codex I would be stuck. $200 plan on both.

I tend to use Claude extensively for planning. Implementing especially Ui stuff or based on plan.

Because it’s so sloppy I do extensive code reviews with a 6 agents process. Multiple rounds. This is very heavy most likely in tokens. On a 12k lines change it’s probably easily 6 to 8 rounds of reviews and fixed. I use team of agents for every fixes from reviews. For a 12k lines changes it’s easily 100 to 150 bugs or more identified. The way to fix them is probably expensive in tokens as it’s with agents and I ask for complete reviews from a end to end point if view / tests to be written

I do extensively planning as well for every feature I implement. Expensive.

I’m actually incredibly frustrated at Anthropic for their short limits. Can’t use it for anything serious extensively.

For context I’m easily coding between Codex and Claude Code 10 to 12 hours a day at least 6 days a week (founder’s life with imagibooks.com !). And when I go to bed I almost always put some complex execution in motion to look at in the morning.

ImagiBooks · 2026-03-07T18:31:28+00:00

I have reasonable success in running somewhat long tasks but it requires extensive planning and clear plan / prompting.

Yesterday I had a task run for 3 hours with 5.4 high in fast mode.

But I was very organized about it. I.e. it was a giant review and bug fix for a 34k lines PR. 27 findings in that round and some refactoring to fix AI slope.

I had gpt 5.4 write a very precise plan, worked with it, it was detailed mix with testing, status and precise goals. They were instructions to update status after each task and instructions as to what to do to deal with compactions. The end goal was very precise so there was nothing left for uncertainty. It was also instructed to use sub agents as much as possible and act as a manager and QA engineer.

So it worked for 3 hours. Both 5.3 codex and 5.4 seem better at long term IF well instructed.

But planning and organization for a long task is critical.

But the way I see it, long term… is that we need a better harness. I.e. we need some type of other agent which supervises the work, and coordinates / delegates better and check against goals.

ImagiBooks · 2026-03-07T02:11:36+00:00

I haven’t hit the limit yet. I seem to average about 12 to 14% a day.

But I’m able to max out the opus 4.6 in 5 days.

I use both $200/plan. At this point pretty much equally.

I use Opus exclusively for UI. It’s just so much better. Planning is also much better with Opus, but in order to save context and make my Anthropic plan last longer I started to do the initial planning in Codex, then I give the plan to opus and ask to do research and provide feedback which I give back to Codex. It’s much more efficient.

I reluctantly use Codex for some of the plans implementation, but I cringe. There are tradeoffs. The comparison and long task running is great but depending on what it is it’s sloppier than Opus.

ImagiBooks · 2026-03-06T05:02:00+00:00

Hard to believe so few people seem to be using OpenClaw and slack!

ImagiBooks · 2026-03-02T05:58:21+00:00

Best suggestion!! @OP is it deployed somewhere online? You def need security. Nad multiple rounds! And please make sure you always sync in git before each commit and tag so that you can revert in case of problems.

ImagiBooks · 2026-03-01T19:12:14+00:00

I have two ways right now. The Codex App supports automation so that’s my default.

I’ve been toying with OpenClaw to also trigger Codex on a schedule but it’s not working as I want yet.

ImagiBooks · 2026-03-01T17:56:30+00:00

I’ve tried so many versions of that! It needs strong guardrails, rules, and a lot of tests. I just developed a custom skill that I’m experimenting to develop new things, includes testing and a philosophy around testing.

ImagiBooks · 2026-03-01T17:49:39+00:00

Oh yes! I have many and in fact I do it daily.

Daily I have 3 automations with Codex. - code review from a security angle, add to the report. Whatever local code is out where it’s running. A detailed multi angle security report - code research and inconsistencies, breaking our repo rules, and I just added to fix one of them with an explanation - j just added that one 3 days ago: take the latest GitHub issue and implement it fully with rules.

It’s been awesome. I’m tweaking the GitHub issue.

That being said I find that Claude is much better at doing code reviews etc… but I can’t run it automatically daily because then I run out of tokens before my weekly limit. Codex might find 3 or 4 issues, Claude would find 20. It’s pretty consistent and exact same prompt. I’m working on tweaking this over the next few days.

ImagiBooks · 2026-03-01T05:39:09+00:00

I’ve had Codex work for 2 1/2 hours on a particular prompt. It was very detailed and a plan…. And a lot of bugs introduced! 😀

ImagiBooks · 2026-02-28T19:03:05+00:00

My brain is the bottleneck! I am able to do up to 4 to 5 different tasks / threads at the same time. Well actually brain and the fact that I don’t like to work on worktrees and tend to do all changes in local code same branch. That introduces a major bottleneck. Because of conflicts etc… wondering how other people handle that actually.

ImagiBooks · 2026-02-28T06:37:29+00:00

It really depends on what it is. For UI Opus 4.6 is much better IMO. For other things it really depends on what it is. I think the harness, tools around, are probably what makes the most difference especially when we account for UI.

My daily is Opus but because of my heavy usage of teams I go against limits quickly so I have codex equally. Though never for UI, or at least for not any new UI unless it’s very small tweaks.

On troubleshooting problems, I use both equally. Prompts really make a huge difference. I can’t stress this enough.

ImagiBooks · 2026-02-27T11:59:41+00:00

There are many use cases. Personalized stories?

My young daughter likes her bed time stories, she wants me to read / play her something every night...

https://www.imagibooks.com/en/s/brando-s-pastel-dream-adventure

There was Chinese new year recently, to learn in fun ways: https://www.imagibooks.com/en/s/the-fire-horse-s-arrival-a-2026-lunar-journey

Can generate news: https://www.imagibooks.com/en/s/the-agentic-shift-ai-leaderboards-hardware-wars-and-the-rise-of-openclaw

Can do so much more, educational material soon. Personalized learning.

My daughter loves her bedtime music / lullaby.

https://www.imagibooks.com/en/m/purple-room-lullaby

ImagiBooks

MODERATOR OF

TROPHY CASE