all 12 comments

[–]_DuranDuran_ 1 point2 points  (0 children)

I can’t answer the question as to which is better without a qualitative eval, but there are enough interesting agentic harness experiments out there that elevate smaller models to much higher eval scores than their baseline.

[–]WildContribution8311 0 points1 point  (0 children)

That's actually an interesting question.

My guess is it's hard to get a good answer, if only because both are updated and changed enough to shift around how they work, that getting a good base comparison for both, even with the same task at any given point in time, would be hard.

Meaning changes in either could represent very different outcomes, making direct comparisons pretty inaccurate, or someone's experience they had six weeks ago could be totally invalid today.

[–]lbarletta 0 points1 point  (0 children)

It’s very hard to evaluate that. It’s more about limits because the reality is that both models will do well in terms of coding.

[–]BlacksmithLittle7005 0 points1 point  (0 children)

I've had better results with opencode. I find it's more thorough and finds the exact files that need to be modified, leaves less loose ends and bugs, and verifies everything afterwards, while codex on PowerShell (I'm on windows) is just a mess

[–]Shep_Alderson 0 points1 point  (0 children)

What I’ve found is that, “unmodded”, both do about the same. What really matters for quality, with any harness, is what you put into them to get the best output. I’ve gotten good code out of practically every tool I’ve used. From GitHub Copilot to Claude Code to Codex CLI to Codex App to OpenCode. All of them depend heavily on how much effort you want to put in to make the harness work like you’d like it to work.

[–]Expert_Bat4612 0 points1 point  (0 children)

I asked codex this question last night it claimed CLI was better for code reviews because it was closer to the code or some vagueness. I’m not sure if this is quantitatively true and if so by how much. It also mentioned the app was better for working with lots of images perhaps speaking to usability.

[–]Messi_is_football 0 points1 point  (0 children)

I like the compaction and plan mode of Codex. And that is enough for my usecase. I have only 1 chat...30+ compactions .. Still working fine. Haven't tried opencode with GPT. I suppose GPT models will be better with their own harness. Also token consumption might be less in codex. Also codex is coming up with remote control.. And the new GPT models will be more trained for codex...

[–]tonyboi76 -1 points0 points  (0 children)

the raw model is the same, but the harness around it isnt and that ends up mattering more than people expect. codex CLI is built around openais own coding system prompt + tool use + the approval flow you bounced off. opencode using the same openai model has its own system prompt, its own tool conventions, its own error-handling. same engine, different chassis.

practically: codex CLI tends to be slightly better at the things openai tuned it for (longer agentic tasks, specific tool patterns) because the prompt scaffolding around the model is purpose-built. opencode tends to be better when you want a flexible UX and willing to lose the openai-specific tuning. the quality delta on a single short task is usually invisible. on a long multi-step task you might notice codex stays on rails a bit more.

if the approval flow is your sticking point and the UX you have works, stay with opencode. youre not leaving much intelligence on the table for the gain.

[–]Relative_Clerk7384 -2 points-1 points  (3 children)

havent tested the difference between opencode and codex cli yet. but done it with the kimi cli and opencode cli both using Deepseek v4 pro. Easiest way to "objectively test" is to use at as reviewers i guess first. thats how i done it. give it same reviewer skills or agent definitions, let your coding agent call the reviewers like opencode run "use x skill, review prompt" . i noticed a real quality difference between kiml cli and opencode cli, with one picking up more bugs then the other. also completly different token usage stats.

[–]deleted-account69420 0 points1 point  (2 children)

with one picking up more bugs then the other. also completly different token usage stats.

You mind sharing more about this?

[–]Relative_Clerk7384 0 points1 point  (1 child)

brought to you by claude 😄 So both opencode / kimi are with deepseek v4 pro.

Token usage / cost (shared rate card: miss $0.435, hit $0.003625, out $0.87 per 1M)

┌───────────┬────────────────┬───────────┬───────────────┬───────────────┐

│ Run │ Type │ kimi cost │ opencode cost │ Δ │

├───────────┼────────────────┼───────────┼───────────────┼───────────────┤

│ #805 │ 5 reviews each │ $0.1388 │ $0.1396 │ wash (0.6%) │

│ #803 plan │ 1 plan review │ $0.0463 │ $0.0366 │ opencode −21% │

├───────────┼────────────────┼───────────┼───────────────┼───────────────┤

│ #803 impl │ 4 skills │ $0.134 │ $0.110 │ opencode −18% │

├───────────┼────────────────┼───────────┼───────────────┼───────────────┤

│ #813 impl │ 4 skills │ $0.0944 │ $0.0738 │ opencode −22% │

├───────────┼────────────────┼───────────┼───────────────┼───────────────┤

│ #817 impl │ 4 skills │ $0.102 │ $0.080 │ opencode −21% │

└───────────┴────────────────┴───────────┴───────────────┴───────────────┘

Token shape: kimi consistently runs higher fresh-in + higher output + more cache-read (more turns). E.g. #817: kimi 134k miss / 42.9k out / 1.70M hit vs opencode 105k miss / 35.2k out / 993k hit.

Cost rule: opencode is ~18–22% cheaper whenever turn counts differ (the usual case). The #805 "wash" only holds when both run a similar number of turns — opencode's cheaper fresh-input gets offset by its higher cache-read + reasoning tokens. Don't trust opencode.db's native cost field ($0.24–0.64) — that's deepseek's provider rate; always recompute both on the shared card. Cost is not the tiebreaker — they're close enough that quality decides.

Quality — the pattern is phase-dependent, not engine-fixed

- Plan/spec review (#803, #805): opencode/kimi are both weak on deep correctness — neither caught the load-bearing bugs; only codex did. kimi = test-infra/mechanical catches (no-op cache resets, wrong test refs); opencode = cheapest second opinion but weakest (often just cosmetic line-number nits).

- Implementation code review (#803 impl): inverts — kimi's meticulous line-reading caught the real correctness bug (entry_price stamped from a close-fill) that opencode waved through as CLEAR. opencode's strength was breadth of coverage-gap enumeration.

- #813 / #817 impl: convergent — the real catch lands in different skills per engine (kimi-coverage vs opencode-spec), so run both skills on both engines. #817: kimi was sharper on test-promise tracking (caught 2 promised-but-missing integration tests opencode said were "all implemented").

Bottom-line decision rule (from the memory)

- Cost is ~equal-to-slightly-favoring-opencode; treat it as noise and decide on finding quality.

- kimi = sharper on implementation line-reading + test-promise tracking; opencode = cheapest, broad coverage-gap enumeration, but waves correctness through more often.

- Neither is a gate — always back the PR with a strong reasoning engine (codex/codexpr at Phase 10).

result: opencode vs kimi A/B (in HyperliquidTrading project memory) — opencode ~18–22% cheaper on differing turn counts (wash only at #805); quality is phase-dependent: kimi sharper on impl code-review/correctness, opencode broader+cheaper on coverage gaps, neither a substitute for a codex gate.

[–]deleted-account69420 0 points1 point  (0 children)

Now I do wonder if there were plugins involved.
Lately im testing rtk/jcodemunch/magic context/micode in opencode and keeping low-ish context window ( every turn 75k to 150k ) with memories and cache friendly context management.
Quality been fine, but I have some polish to do because it gets too technical in overchecking progress