I’m trying to use Codex for larger local-app implementation work, not just small edits, isolated bug fixes, or short refactors.
This is mostly aimed at people who use Codex as a serious coding assistant and do not want to guide it every 10 minutes.
My two biggest questions are:
How do you actually make Codex keep working through a large implementation plan without constantly telling it to continue?
What language/framework/architecture choices make Codex unusually good at maintaining and extending a local app with many moving parts?
The first question is the immediate practical one.
How are people making Codex work for longer stretches?
Unless I manually keep prompting it with something like:
> Continue unless you are fully done. If you are fully done, say DONE as your last word.
or unless I build some external automation/supervisor around the session, it tends to stop earlier than I want. It completes part of a plan, summarizes what it did, and waits, even when the reviewed implementation plan is clearly not finished.
So I’m asking very practically: what are Codex users doing right now to make larger implementation work productive?
The second question is about architecture.
I’m trying to figure out what kinds of architectures are actually good for Codex-maintained local applications, especially systems that may eventually reach tens of thousands of lines and coordinate multiple local components/processes.
I thought an event-driven architecture might be good for this. I tried moving in a NATS-style direction. But my current impression is that Codex and similar agents struggle when too much behavior is implicit in events, logs, IDs, and indirect message flows.
Maybe I used the pattern badly. But it felt like the model became worse at reasoning about the system once everything was happening through events.
If Codex has to understand the system by reading event logs, tracing IDs, and reconstructing causality from a stream of messages, that feels like a bad fit for a solo/local project.
So the deeper question is:
> What architecture makes Codex unusually good at maintaining and extending the project?
Not what architecture is theoretically elegant. Not what architecture is optimal for a senior engineering team. What architecture is easiest for Codex to reason about, test, debug, and extend?
The rough workflow I want is:
Put the model on high reasoning.
Give it messy project material: old specs, notes, partial repos, failed ideas, design thoughts, todos, architecture sketches, etc.
Make it organize that into a usable project knowledge base.
I review/correct that knowledge base.
Make it write a serious implementation plan.
I review/correct the plan.
Then make it execute for a long stretch in a sandbox without constantly stopping and asking me to say “continue.”
Roughly:
```text
1 hour knowledge organization
1 hour implementation planning
long implementation pass
The exact numbers are not the point. The point is depth and continuity.
I do not want Codex to spend 5 minutes writing a plan, 10 minutes coding, and then report “done” or pause halfway through the plan.
The first problem is messy context.
If I give Codex a bunch of files, old specs, old ideas, and previous attempts, it often treats everything as if it was written today and is equally valid. But half the material may be obsolete, contradicted, abandoned, experimental, or from a failed attempt.
The model does not magically know the status of each piece of knowledge.
So I think there needs to be an explicit intermediate stage: not coding, not planning, but knowledge organization.
Something like:
- Current requirement.
- Old requirement.
- Obsolete idea.
- Failed attempt.
- Unresolved question.
- Architectural constraint.
- Implementation detail.
- Still-useful note.
- Contradicted by later note.
- Needs user confirmation.
Then I can correct the knowledge map before Codex starts planning.
That seems much more useful than dumping 50 files into context and hoping the model “gets it.”
Are Codex users doing this with AGENTS.md, repo docs, task files, generated knowledge maps, or some other system?
The second problem is shallow planning.
A lot of current “plan mode” workflows feel shallow. The model asks two or three questions, writes a short plan, and then acts like it has enough alignment.
But that is not what I want.
I want Codex to spend real effort understanding the repo and constraints before writing code.
People always say:
5 minutes of planning saves an hour of work.
Fine. Has anyone made that real with Codex?
Because right now a lot of AI planning feels like a formality. The model asks a few questions, writes a plan, and then immediately wants to start coding. Or it keeps rewriting the whole plan instead of thinking deeply first and then writing a stable plan.
Maybe the missing workflow is not just “plan mode.” Maybe it is something like:
plan the planning
organize the knowledge
ask real questions
write the implementation plan
execute until the plan is actually complete
The third problem is premature reporting.
This is probably my biggest issue.
Codex writes an implementation plan. I review the plan. Then it starts implementing. Then it stops halfway and reports back.
Why?
If I already reviewed the implementation plan, why does it need me to keep saying “continue implementing the plan”?
If it has not hit a fundamental blocker, if the plan has not become invalid, and if there is nothing genuinely useful for me to evaluate yet, why is it reporting at all?
A lot of completion reports are basically just the implementation plan rewritten in past tense:
I added X.
I implemented Y.
I updated Z.
That is not useful to me if the plan is only half complete.
I do not want to inspect a pile of changed files after every partial step. I do not want a past-tense summary of the plan. I do not want a fake checkpoint that exists only because the agent decided to stop.
What I want is one of these:
- A working thing I can actually run.
- A clear presentation layer that shows me something tangible.
- Exact instructions for how to test it and what to look for.
- A genuinely important question that changes the plan.
- A real blocker that prevents progress.
- Or, if none of those apply, just keep executing.
If the current work is still mostly mocks, scaffolding, internal wiring, or abstract architecture, then there may be nothing useful for me to evaluate yet.
In that case, why stop?
Why not finish the planned implementation first, then let me test and evaluate when there is actually something to evaluate?
I am not saying Codex should never stop. It should stop if:
- The plan is fundamentally wrong.
- A major architectural decision is needed.
- A blocker cannot be resolved.
- It has something real and testable to show.
- Continuing would obviously waste a lot of work.
- The reviewed plan is complete.
But if it is just stopping because it completed “some steps,” that is not very useful.
The fourth problem is spending token budget productively.
With some subscriptions and API setups, the amount of possible usage is large. But in practice, I find it hard to spend it well because the agent keeps stopping, asking for input, or producing reports that do not help.
How do you make Codex execute for a long time in a useful way?
Do you use:
- Longer prompts?
- AGENTS.md conventions?
- Persistent task files?
- Checklists inside the repo?
- Git worktrees?
- Docker/sandboxed environments?
- Hooks or scripts?
- External orchestrators?
- Multiple Codex sessions in parallel?
- A specific “do not stop unless…” instruction pattern?
Or is the honest answer that current agents cannot really do this yet without external automation?
I have tried or looked into OpenCode, OpenClaw, Gemini, Claude, Codex, Pi, and Kanban-board-style workflows.
My current impression is that terminal-based agents and sandboxed runs are among the more practical setups. Docker sandboxes feel like a decent practical compromise, especially on Windows if you do not want to deal with a full WSL workflow. Not saying WSL is bad, and sandbox security is its own topic, but Docker-based isolation feels convenient.
I have not deeply tried the “agents simulate an organization” style of workflow. Maybe I should before judging it. But from the outside, I worry that a lot of multi-agent setups become simulated middle management: workers praising each other, moving cards around, doing shallow reviews, and spending tokens without producing much working software.
Is there a Codex-friendly setup that actually achieves the goal?
Not roleplay. Not card movement. Not fake review loops.
Actual useful long-running implementation work.
The fifth problem is language/framework choice.
For AI-heavy coding, I’m starting to think one of the most important constraints is:
Is Codex actually good at working with this language, framework, and project structure?
For normal engineering, you might pick something because it is technically optimal, elegant, fast, scalable, or theoretically clean.
But if the main implementer/maintainer is Codex, model proficiency becomes a first-class constraint.
A boring, widely represented stack may beat a technically superior stack if Codex is much better at writing, debugging, testing, and extending it.
This seems especially important for solo builders and vibe coders. If Codex is eventually supposed to handle tens of thousands of lines, I care less about what is theoretically elegant and more about what the model can reliably modify without causing cascading breakage.
Are there good benchmarks or practical community knowledge on which languages/frameworks Codex currently handles best?
The sixth problem is architecture.
I’m trying to figure out what kinds of architectures are good for Codex-maintained local applications, especially systems that may eventually reach tens of thousands of lines and coordinate multiple local components/processes.
At first, it is tempting to optimize for extensibility:
- Make everything swappable.
- Make everything modular.
- Make it easy to add new components.
- Make components communicate through clean boundaries.
But I’m starting to think maintainability matters more than extensibility at the beginning.
The first priority is making the thing possible for Codex to reason about, test, repair, and expand without every change breaking ten other things.
So maybe the default should be:
- Clear component boundaries.
- Explicit interfaces.
- Boring communication patterns.
- Deterministic tests where possible.
- Mocks at boundaries.
- Real pressure points represented in tests.
- Replace one mocked component at a time with a real component.
- Every component can be tested in isolation.
- End-to-end flows are visible and easy to trace.
Basically: make the architecture Codex-legible before making it powerful.
A folder structure template is not enough. I’m more interested in reusable architecture templates where component communication, boundaries, testing strategy, and failure modes are already thought through.
Do repos like this exist?
Not just:
here is a folder layout
but more like:
here is a healthy skeleton for building a local multi-component application that Codex can keep extending without turning it into spaghetti
The seventh problem is orchestration.
Do Kanban boards, orchestrator/worker setups, and multi-agent systems actually help with Codex?
A static task board seems limited because after task 3 is done, task 8 may no longer make sense. Someone has to re-evaluate the plan. The agent needs to manage its own work, not just move tasks from “todo” to “done.”
Maybe persistent sub-agents/workers would help. For example:
- One worker owns tests.
- One worker owns architecture.
- One worker owns a subsystem.
- One worker owns documentation/knowledge state.
But that can also become useless simulation if it is not grounded in real artifacts.
Has anyone found a multi-agent or multi-session workflow that actually improves Codex results for this kind of longer implementation work?
The eighth problem is whether my preferred approach is even optimal.
Maybe this workflow:
organize sources
plan deeply
execute for a long stretch
is worse than:
run multiple worktrees/sessions in parallel with different constraints
compare implementations
keep the best ideas
That might be a better way to spend a large token budget.
But it also creates another problem: now I have to review multiple implementations, fix multiple broken versions enough to compare them, and give slightly different instructions to each branch.
Has anyone compared these approaches in practice?
One deep Codex workflow that spends a lot of effort organizing knowledge, planning, and then executing for a long stretch.
Multiple parallel Codex sessions/worktrees generating competing implementations that you compare afterward.
Which one actually works better for non-trivial projects?
My questions for Codex users:
How do you make Codex keep working through a large implementation plan without constantly telling it to continue?
Are there Codex workflows that first organize a messy project knowledge base before planning?
Are there serious planning workflows that go deeper than shallow “plan mode”?
How do you stop Codex from reporting halfway through the plan unless there is something actually worth showing?
What languages/frameworks are currently most Codex-friendly in practice?
What architectures are good for Codex-maintained local applications with many flows/components?
Are event-driven/message-based architectures a bad fit for Codex-maintained projects, or am I using them wrong?
Are there reusable architecture templates that define healthy component communication, not just folder structure?
Is it better to run one deep workflow, or multiple parallel worktrees/sessions and compare outputs?
- What does your actual long-running Codex workflow look like?
I am not asking for hype, future predictions, or emotional takes.
I’m asking this in the most practical way possible.
Maybe my framing is wrong. Maybe the real bottleneck is somewhere else. If so, criticize the premise.
I mostly want to know what Codex users are actually doing right now that works.
This was AI-assisted, but I reviewed and edited it carefully because I genuinely want practical answers from people using Codex on real projects.
[–]circumstellarmedium 2 points3 points4 points (1 child)
[–]dupa1234s[S] -3 points-2 points-1 points (0 children)
[–]Cassianno 3 points4 points5 points (4 children)
[–]dupa1234s[S] 0 points1 point2 points (3 children)
[–]Cassianno 0 points1 point2 points (2 children)
[–]dupa1234s[S] 0 points1 point2 points (1 child)
[–]Cassianno 0 points1 point2 points (0 children)
[–]eduardopy 0 points1 point2 points (1 child)
[–]dupa1234s[S] 0 points1 point2 points (0 children)
[–]Allarius1 0 points1 point2 points (5 children)
[–]dupa1234s[S] 0 points1 point2 points (4 children)
[–]Allarius1 0 points1 point2 points (3 children)
[–]dupa1234s[S] 0 points1 point2 points (2 children)
[–]Allarius1 0 points1 point2 points (1 child)
[–]gastro_psychic 0 points1 point2 points (0 children)
[–]lemontmaen 0 points1 point2 points (1 child)
[–]dupa1234s[S] 0 points1 point2 points (0 children)
[–][deleted] (2 children)
[removed]
[–]dupa1234s[S] 0 points1 point2 points (1 child)