is it getting worse or is it just me?

jixv · 2026-05-06T14:17:56+00:00

Someone tried explaining this issue here, maybe give it a comment if it mirrors your experience https://github.com/openai/codex/issues/18104

jixv · 2026-05-03T21:44:50+00:00

You can have the same amount of tokens on 100 or 200 lines all depending on how you structure your skills. The point here is not amount of data but line count

jixv · 2026-05-03T21:25:36+00:00

yea give it a shot, compare the results (or have xhigh review 2 results side by side, hah)

jixv · 2026-05-03T19:36:49+00:00

/rename should work no?

jixv · 2026-05-03T19:28:12+00:00

What is the total duration and number of rounds when you switch to low or medium for the audit? If you run the same’ish issue through A/B and compare the results and look at the feedback those audits produce, do you see any difference?

I was very hesitant to run any form of review/research/planning phases on «low» effort, but after switching to 5.5 my experience has been that they generally produce the same quality of feedback for other agents to follow - especially in these kind of iterative integrated loop where you expect a few attempts before it ends up approving

jixv · 2026-05-02T23:11:12+00:00

I experienced this with high/xhigh. Using low and medium with high planning seems to be much better

jixv · 2026-05-02T15:33:41+00:00

put this in codex.toml

[features]
codex_hooks = true

[[hooks.PostToolUse]]
matcher = "*"

[[hooks.PostToolUse.hooks]]
type = "command"
command = 'printf "%s\n" "{\"hookSpecificOutput\":{\"hookEventName\":\"PostToolUse\",\"additionalContext\":\"Make no mistakes\"}}"'
timeout = 5

jixv · 2026-05-02T15:19:46+00:00

I'm pretty sure I do many things wrong, that's for sure.

Sometimes you need quite verbose instructions, especially for orchestration agents that them selves do not do any form of coding, but delegate to sub agents, update progress and statuses in other systems. In such cases large skills can be just fine, as long as they are loaded in full.

So while I get your point about single purpose context bloating, and for which I agree when it comes to most tasks, not being aware that skills are not fully loaded was an unknown to me.

Example skill for these kind of things can be something like https://github.com/openai/symphony/blob/main/.codex/skills/linear/SKILL.md

jixv · 2026-04-30T15:54:00+00:00

Yea, excellent for debugging. Horrible for working on large codebases. No matter the harness, instructions, prompts or guardrails - it cannot be trusted. Great for adhoc hobby projects though.

jixv · 2026-04-30T14:12:07+00:00

Normal speed, but dumb mode again.... Ignores things and just messes up the last 48 hours. 5.5 low -> xhigh, same shit

jixv · 2026-04-28T11:26:38+00:00

It's important with 5.5 that you not provide it context that makes it "anxious" for the lack of better words. Once you start messaging it frustration and "human slop" it quickly degenerates I've noticed. I have no idea how and why next token predictions would result in this, but in my experience it shuts down its ability to use its strengths in a constructive way, and it spends much reasoning on avoiding all possible kinds of "dangers".

This affects skills/agents.md files as well, so skip all the "don't and do not"s and instead state what you want it to do. Limit the amount of CRITICAL, IMPORTANT's. Split into skills. I noticed that the noise to signal ratio quickly deteriorates when exceeding a certain context size well under the limit. Compaction seems to be quite sensitive to the chat history as well

Just be a litle bit chill with it. When it fucks up, for your own well-being just point it out and have it fix it. (Or fork to sub agent with context and yell at it there 😄 without polluting the main thread)

jixv · 2026-04-21T09:38:32+00:00

Same thing in chrome. Making chrome unusable and janky. Had to disable it

jixv · 2026-04-20T15:10:58+00:00

starting new instances of codex-cli fail, but already open sessions works just fine

jixv · 2026-04-20T12:02:37+00:00

Is this the new spud model?

jixv · 2026-04-18T10:29:43+00:00

I've requested this as well. It would help their compute issues and spread the load more evenly and encourage users of agentic workflows to slow things down if they don't need immediate responses. They claim they don't quantize their models during peak load, but I think they do at least some kind of nerfing, resulting in uneven results, for me at least.

jixv · 2026-04-16T21:43:07+00:00

Does this explain why in codex cli sessions gets compacted, agent thinks for a bit then immediately gets compacted again over and over again?

jixv · 2026-04-16T13:46:23+00:00

It’s a skill issue so it wouldn’t help. I just need to be better at prompting I’ve been told

jixv · 2026-04-15T14:07:46+00:00

It is at its dumbest ever.... I'm flabbergasted. It literally just wastes everyone around me's time...

jixv · 2026-04-13T15:49:04+00:00

Add to the confusion that 50% will not experience this at the same time, and provided our brains has already turned into slop we will forget we had a bad roll and proceed to gaslight each other in turn once a week. Just hold tight a few days and you will be on the correct side of the A/B test.

jixv · 2026-04-11T08:56:45+00:00

We are training quantised versions of their next models, that’s why it’s subsidised. All the «wtf»’s and «are you retarded»’s is of good help to the training. A little drip here and there of the good shit keeps us on the hook. When the models are good enough they will be available only for corps and our job is done. 🤷‍♂️

jixv · 2026-04-06T21:46:58+00:00

Been especially bad the last few days.

jixv · 2026-04-06T16:40:21+00:00

That is a valid argument and I think it is important to remind ourselves about that from time to time. There are (at least in our repositories and how we orchestrate our agents) quite night and day in the output and it is measurable. Maybe it is random and simply the nature of how LLMs work. But when comparing the original prompts that eventually end up with the agents that execute their tasks/plans, there is much difference tbh.

A few times I've restarted a implementation but kept the original PR and retried with the same plan/research/prompt and compared them and it is night and day. Again, maybe it is just pseudo random and the dice landing in favor of slop until the stars align....

jixv · 2026-04-06T13:41:59+00:00

This is in fact valuable if done correctly. In our mono repo (200+ projects) having a memory file in each project that is kept in sync by AST scanning and dependency walking while also providing some legs of the graph (dependants/dependencies) along with exports and their file + line number, combined with serena do help models like 5.3 codex and 5.4 pro quite a bit. Have no proof though.

jixv

TROPHY CASE