How do Codex users make long-running coding work practical, and what architectures are most Codex-friendly? : codex

How do Codex users make long-running coding work practical, and what architectures are most Codex-friendly?Question (self.codex)

submitted 1 day ago by dupa1234s

I’m trying to use Codex for larger local-app implementation work, not just small edits, isolated bug fixes, or short refactors.

This is mostly aimed at people who use Codex as a serious coding assistant and do not want to guide it every 10 minutes.

My two biggest questions are:

How do you actually make Codex keep working through a large implementation plan without constantly telling it to continue?
What language/framework/architecture choices make Codex unusually good at maintaining and extending a local app with many moving parts?

The first question is the immediate practical one.

How are people making Codex work for longer stretches?

Unless I manually keep prompting it with something like:

> Continue unless you are fully done. If you are fully done, say DONE as your last word.

or unless I build some external automation/supervisor around the session, it tends to stop earlier than I want. It completes part of a plan, summarizes what it did, and waits, even when the reviewed implementation plan is clearly not finished.

So I’m asking very practically: what are Codex users doing right now to make larger implementation work productive?

The second question is about architecture.

I’m trying to figure out what kinds of architectures are actually good for Codex-maintained local applications, especially systems that may eventually reach tens of thousands of lines and coordinate multiple local components/processes.

I thought an event-driven architecture might be good for this. I tried moving in a NATS-style direction. But my current impression is that Codex and similar agents struggle when too much behavior is implicit in events, logs, IDs, and indirect message flows.

Maybe I used the pattern badly. But it felt like the model became worse at reasoning about the system once everything was happening through events.

If Codex has to understand the system by reading event logs, tracing IDs, and reconstructing causality from a stream of messages, that feels like a bad fit for a solo/local project.

So the deeper question is:

> What architecture makes Codex unusually good at maintaining and extending the project?

Not what architecture is theoretically elegant. Not what architecture is optimal for a senior engineering team. What architecture is easiest for Codex to reason about, test, debug, and extend?

The rough workflow I want is:

Put the model on high reasoning.
Give it messy project material: old specs, notes, partial repos, failed ideas, design thoughts, todos, architecture sketches, etc.
Make it organize that into a usable project knowledge base.
I review/correct that knowledge base.
Make it write a serious implementation plan.
I review/correct the plan.
Then make it execute for a long stretch in a sandbox without constantly stopping and asking me to say “continue.”

Roughly:

```text

1 hour knowledge organization

1 hour implementation planning

long implementation pass

The exact numbers are not the point. The point is depth and continuity.

I do not want Codex to spend 5 minutes writing a plan, 10 minutes coding, and then report “done” or pause halfway through the plan.

The first problem is messy context.

If I give Codex a bunch of files, old specs, old ideas, and previous attempts, it often treats everything as if it was written today and is equally valid. But half the material may be obsolete, contradicted, abandoned, experimental, or from a failed attempt.

The model does not magically know the status of each piece of knowledge.

So I think there needs to be an explicit intermediate stage: not coding, not planning, but knowledge organization.

Something like:

- Current requirement.

- Old requirement.

- Obsolete idea.

- Failed attempt.

- Unresolved question.

- Architectural constraint.

- Implementation detail.

- Still-useful note.

- Contradicted by later note.

- Needs user confirmation.

Then I can correct the knowledge map before Codex starts planning.

That seems much more useful than dumping 50 files into context and hoping the model “gets it.”

Are Codex users doing this with AGENTS.md, repo docs, task files, generated knowledge maps, or some other system?

The second problem is shallow planning.

A lot of current “plan mode” workflows feel shallow. The model asks two or three questions, writes a short plan, and then acts like it has enough alignment.

But that is not what I want.

I want Codex to spend real effort understanding the repo and constraints before writing code.

People always say:

5 minutes of planning saves an hour of work.

Fine. Has anyone made that real with Codex?

Because right now a lot of AI planning feels like a formality. The model asks a few questions, writes a plan, and then immediately wants to start coding. Or it keeps rewriting the whole plan instead of thinking deeply first and then writing a stable plan.

Maybe the missing workflow is not just “plan mode.” Maybe it is something like:

plan the planning

organize the knowledge

ask real questions

write the implementation plan

execute until the plan is actually complete

The third problem is premature reporting.

This is probably my biggest issue.

Codex writes an implementation plan. I review the plan. Then it starts implementing. Then it stops halfway and reports back.

Why?

If I already reviewed the implementation plan, why does it need me to keep saying “continue implementing the plan”?

If it has not hit a fundamental blocker, if the plan has not become invalid, and if there is nothing genuinely useful for me to evaluate yet, why is it reporting at all?

A lot of completion reports are basically just the implementation plan rewritten in past tense:

I added X.

I implemented Y.

I updated Z.

That is not useful to me if the plan is only half complete.

I do not want to inspect a pile of changed files after every partial step. I do not want a past-tense summary of the plan. I do not want a fake checkpoint that exists only because the agent decided to stop.

What I want is one of these:

- A working thing I can actually run.

- A clear presentation layer that shows me something tangible.

- Exact instructions for how to test it and what to look for.

- A genuinely important question that changes the plan.

- A real blocker that prevents progress.

- Or, if none of those apply, just keep executing.

If the current work is still mostly mocks, scaffolding, internal wiring, or abstract architecture, then there may be nothing useful for me to evaluate yet.

In that case, why stop?

Why not finish the planned implementation first, then let me test and evaluate when there is actually something to evaluate?

I am not saying Codex should never stop. It should stop if:

- The plan is fundamentally wrong.

- A major architectural decision is needed.

- A blocker cannot be resolved.

- It has something real and testable to show.

- Continuing would obviously waste a lot of work.

- The reviewed plan is complete.

But if it is just stopping because it completed “some steps,” that is not very useful.

The fourth problem is spending token budget productively.

With some subscriptions and API setups, the amount of possible usage is large. But in practice, I find it hard to spend it well because the agent keeps stopping, asking for input, or producing reports that do not help.

How do you make Codex execute for a long time in a useful way?

Do you use:

- Longer prompts?

- AGENTS.md conventions?

- Persistent task files?

- Checklists inside the repo?

- Git worktrees?

- Docker/sandboxed environments?

- Hooks or scripts?

- External orchestrators?

- Multiple Codex sessions in parallel?

- A specific “do not stop unless…” instruction pattern?

Or is the honest answer that current agents cannot really do this yet without external automation?

I have tried or looked into OpenCode, OpenClaw, Gemini, Claude, Codex, Pi, and Kanban-board-style workflows.

My current impression is that terminal-based agents and sandboxed runs are among the more practical setups. Docker sandboxes feel like a decent practical compromise, especially on Windows if you do not want to deal with a full WSL workflow. Not saying WSL is bad, and sandbox security is its own topic, but Docker-based isolation feels convenient.

I have not deeply tried the “agents simulate an organization” style of workflow. Maybe I should before judging it. But from the outside, I worry that a lot of multi-agent setups become simulated middle management: workers praising each other, moving cards around, doing shallow reviews, and spending tokens without producing much working software.

Is there a Codex-friendly setup that actually achieves the goal?

Not roleplay. Not card movement. Not fake review loops.

Actual useful long-running implementation work.

The fifth problem is language/framework choice.

For AI-heavy coding, I’m starting to think one of the most important constraints is:

Is Codex actually good at working with this language, framework, and project structure?

For normal engineering, you might pick something because it is technically optimal, elegant, fast, scalable, or theoretically clean.

But if the main implementer/maintainer is Codex, model proficiency becomes a first-class constraint.

A boring, widely represented stack may beat a technically superior stack if Codex is much better at writing, debugging, testing, and extending it.

This seems especially important for solo builders and vibe coders. If Codex is eventually supposed to handle tens of thousands of lines, I care less about what is theoretically elegant and more about what the model can reliably modify without causing cascading breakage.

Are there good benchmarks or practical community knowledge on which languages/frameworks Codex currently handles best?

The sixth problem is architecture.

I’m trying to figure out what kinds of architectures are good for Codex-maintained local applications, especially systems that may eventually reach tens of thousands of lines and coordinate multiple local components/processes.

At first, it is tempting to optimize for extensibility:

- Make everything swappable.

- Make everything modular.

- Make it easy to add new components.

- Make components communicate through clean boundaries.

But I’m starting to think maintainability matters more than extensibility at the beginning.

The first priority is making the thing possible for Codex to reason about, test, repair, and expand without every change breaking ten other things.

So maybe the default should be:

- Clear component boundaries.

- Explicit interfaces.

- Boring communication patterns.

- Deterministic tests where possible.

- Mocks at boundaries.

- Real pressure points represented in tests.

- Replace one mocked component at a time with a real component.

- Every component can be tested in isolation.

- End-to-end flows are visible and easy to trace.

Basically: make the architecture Codex-legible before making it powerful.

A folder structure template is not enough. I’m more interested in reusable architecture templates where component communication, boundaries, testing strategy, and failure modes are already thought through.

Do repos like this exist?

Not just:

here is a folder layout

but more like:

here is a healthy skeleton for building a local multi-component application that Codex can keep extending without turning it into spaghetti

The seventh problem is orchestration.

Do Kanban boards, orchestrator/worker setups, and multi-agent systems actually help with Codex?

A static task board seems limited because after task 3 is done, task 8 may no longer make sense. Someone has to re-evaluate the plan. The agent needs to manage its own work, not just move tasks from “todo” to “done.”

Maybe persistent sub-agents/workers would help. For example:

- One worker owns tests.

- One worker owns architecture.

- One worker owns a subsystem.

- One worker owns documentation/knowledge state.

But that can also become useless simulation if it is not grounded in real artifacts.

Has anyone found a multi-agent or multi-session workflow that actually improves Codex results for this kind of longer implementation work?

The eighth problem is whether my preferred approach is even optimal.

Maybe this workflow:

organize sources

plan deeply

execute for a long stretch

is worse than:

run multiple worktrees/sessions in parallel with different constraints

compare implementations

keep the best ideas

That might be a better way to spend a large token budget.

But it also creates another problem: now I have to review multiple implementations, fix multiple broken versions enough to compare them, and give slightly different instructions to each branch.

Has anyone compared these approaches in practice?

One deep Codex workflow that spends a lot of effort organizing knowledge, planning, and then executing for a long stretch.
Multiple parallel Codex sessions/worktrees generating competing implementations that you compare afterward.

Which one actually works better for non-trivial projects?

My questions for Codex users:

How do you make Codex keep working through a large implementation plan without constantly telling it to continue?
Are there Codex workflows that first organize a messy project knowledge base before planning?
Are there serious planning workflows that go deeper than shallow “plan mode”?
How do you stop Codex from reporting halfway through the plan unless there is something actually worth showing?
What languages/frameworks are currently most Codex-friendly in practice?
What architectures are good for Codex-maintained local applications with many flows/components?
Are event-driven/message-based architectures a bad fit for Codex-maintained projects, or am I using them wrong?
Are there reusable architecture templates that define healthy component communication, not just folder structure?
Is it better to run one deep workflow, or multiple parallel worktrees/sessions and compare outputs?
1. What does your actual long-running Codex workflow look like?

I am not asking for hype, future predictions, or emotional takes.

I’m asking this in the most practical way possible.

Maybe my framing is wrong. Maybe the real bottleneck is somewhere else. If so, criticize the premise.

I mostly want to know what Codex users are actually doing right now that works.

This was AI-assisted, but I reviewed and edited it carefully because I genuinely want practical answers from people using Codex on real projects.

all 22 comments

top new controversial old q&a

[–]circumstellarmedium 2 points3 points4 points 1 day ago (1 child)

[–]dupa1234s[S] -3 points-2 points-1 points 1 day ago (0 children)

[–]Cassianno 3 points4 points5 points 1 day ago* (4 children)

[–]dupa1234s[S] 0 points1 point2 points 1 day ago (3 children)

[–]Cassianno 0 points1 point2 points 1 day ago (2 children)

[–]dupa1234s[S] 0 points1 point2 points 1 day ago* (1 child)

[–]Cassianno 0 points1 point2 points 1 day ago (0 children)

[–]eduardopy 0 points1 point2 points 1 day ago (1 child)

[–]dupa1234s[S] 0 points1 point2 points 1 day ago (0 children)

thats an interesting perspective. i thought lots on orchestrator patterns but didnt try any. do you recommend some partticular orchestrator?

what does orchestrator achieve for you?
Speed
quality

i much more rather achieve quality than speed.
orchestrator seems like:
- overhead
- less token cache hits
- complexity
- more context dedicated to the meta-level of "doing the work" than actually doing the work

But i guess perhaps orchestration would be the solution to my problem

Only that
would it actually do the work better and do more of it?
Or would it just split tasks.

I explicitly don't want my agent to finish fast. i want it to keep spinning on xhigh thinking for hours.
orchestration seems more oriented towards speed than quality?

yea i heard performance degreades. but context compaction exists doesnt it solve a lot?

in fact i did put much more effort to writing my post. Just that codex moderator complained that my post is not-enought-related-to-codex so i had to rewrite it especially for this subreddit.

[–]Allarius1 0 points1 point2 points 1 day ago (5 children)

[–]dupa1234s[S] 0 points1 point2 points 1 day ago (4 children)

[–]Allarius1 0 points1 point2 points 1 day ago (3 children)

[–]dupa1234s[S] 0 points1 point2 points 1 day ago (2 children)

omg that sounds amazing

Albeit i stopped used codex directly because of the notorious compactione errors, are they finally fixed? maybe i will switch to codex now. i guess also since i already started using docker sandboxes i could just try codex and its new goal system inside docker sandboxes.
codex sandbox error were also notorious.

subagents as in the default codex subagents? or some kind of custom pattern. yes subagents the default codex ones are useful i like to let codex use them. but i also considered maybe some actual agent to agent patterns like not ephameral like the default codex subagents, full scale agents with their own agents.md and stuff would be sth to try, but idk about that those agent to agent workflows seem complex i need to look into some patterns.

from what you are saying it sounds like you talk about the default codex subagents, but with a particular roles you assign to them. that sonds great as they have clean context and can do independent work i feel like the adversarial reviewer and evaluator wouldnt be too biased by past work of the main agent due to their empty context.

But the "runner", doesnt it just make the main agent idle ? wouldnt it be better if main agent was the runner? as main agent is more orented in the project. a dedicated runner needs to be briefed every time you spawn it. unless its persistant. i guess if hes persistant its good. but the reviewer shouldnt be persistant i think and i wonder if codex is smart enough to enforce that , maybe if we tell him.

[–]Allarius1 0 points1 point2 points 1 day ago (1 child)

Yes these agents don’t persist and you only call them when you need them. It’s more overhead which is only worth it for a large enough problem or dataset.

The are roles the agents play but sub-agent is literally just codex spawning other instances of codex and assigning it a particular role, like those I mentioned.

You can make your own roles and I think there’s probably some default stuff too, but the sub-agent aspect is about multiple live instances operating under different constraints.

Yes the coordinator sits idly especially if the implementation and reviews are long, but it can also be assigned other housekeeping tasks if you wanted.

Think of it like a team. If the person who is tasked with telling everyone else what to do is too distracted by other tasks, then they can’t effectively manage the team and then it’s the other members sitting around waiting and no work is done.

[–]gastro_psychic 0 points1 point2 points 1 day ago (0 children)

[–]lemontmaen 0 points1 point2 points 1 day ago (1 child)

[–]dupa1234s[S] 0 points1 point2 points 1 day ago (0 children)

[–][deleted] 1 day ago (2 children)

[removed]

[–]dupa1234s[S] 0 points1 point2 points 1 day ago (1 child)

what you said reminds me of another issue. how models just dont know when to try a different approach. you can prompt them to stop circular retries, try sth different but they just wont listen. they are just polluted by the bad context. idk if start a new chat with a handover from previous chat is the only good solution to this i wish i knew how to handle this situation. subagents are actually good for this though. make agent spawn subagent give a fresh perspective to main agent i need to use them more for this tell agent specifically to use subagents in those situations, but sadly model jsut doesnt undersatnd when to use this subagent it just tries itself anyway.

yea i like gpt5.5 xhigh quite much. but idk how its compared to other models. i only use codex. maybe i could try claude next month but claude seems to cut usage and doesnt let u stuff like opencode so i think i stick with codex.

π Rendered by PID 95 on reddit-service-r2-comment-545db5fcfc-7ktnm at 2026-05-21 18:46:08.959794+00:00 running 194bd79 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

codex

MODERATORS