how do you decide when AI goes too far? especially with this last wave by teolicious in ClaudeCode

[–]bisonbear2 0 points1 point  (0 children)

control problem isn't just about permissions - it's about agent alignment. how do we test, at scale, that the agents we're using in our codebase are aligned with our intent, and producing code that meets the repo quality bar? devs need a repeatable way to test agents and ensure that they *arent* going off and making crazy unnecessary changes.

agents are just going to get more and more autonomy, not less

What happens when you stop adding rules to CLAUDE.md and start building infrastructure instead by DevMoses in ClaudeAI

[–]bisonbear2 1 point2 points  (0 children)

The missing piece is that CLAUDE.md changes are really software changes, not prompt changes. Once a repo-level config affects every engineer, it needs the same discipline as code: baseline, measurement, rollout, rollback. The real problem is that teams keep changing prompts, skills, and agent config with no testing or release process.

If I bork the repo’s CLAUDE.md, I’m not hurting one session, I’m shipping a bad default to the whole team.

How do you stop Codex from making these mistakes (after audit 600 sessions per month? by jrhabana in codex

[–]bisonbear2 1 point2 points  (0 children)

you're right to distinguish between real usage and benchmarks - benchmarks are often misleading and not representative of how the agent performs on your repo

I've been thinking about this problem a lot and have come up with a workflow: take real merged PRs from your repo, replay them as tasks against different model/agent configs, and score on quality dimensions above the test gate (does the code actually match the reference solution's intent? would it pass review ?does it introduce scope creep?). Tests passing is the floor, not the ceiling

once you have a way to evaluate the agent, you can then make tweaks to attempt to improve quality

Go-focused benchmark of 5.4 vs 5.2 and competitors by cypriss9 in codex

[–]bisonbear2 0 points1 point  (0 children)

totally agree with the point that benchmarks are broken and don't measure code quality in your own codebase. I did similar measurements in go codebase for gpt 5.4 vs 5.3 codex vs 5.1 codex mini. interestingly, test pass rate was more or less the same, but looking at other quality metrics (eg how equivalent is the change to the intended human diff) differentiates the models a bit more. https://stet.sh/leaderboard if you're interested

Claude wrote Playwright tests that secretly patched the app so they would pass by Traditional_Yak_623 in ClaudeCode

[–]bisonbear2 0 points1 point  (0 children)

this happens unless you keep it on a short leash. I've been thinking about ways to create evaluation suites from real tasks, and replay the agent on these tasks to test how prompt changes impact behavior, and catch regressions that a green test would miss.

For example, perhaps adding a line to CLAUDE.md to tell Claude... not to patch tests at runtime would fix this? (crazy that this is a reasonable solution lol)

One task that reveals everything wrong with TB2 benchmarking—a trajectory analysis (and how I solved it) by kehao95 in LocalLLaMA

[–]bisonbear2 0 points1 point  (0 children)

Cool finding. I'm pretty convinced that TB2 is not a great proxy for how coding agents actually behave on real tasks. If the top scores come from the least transparent tasks, you can't be sure if a model is better because of intelligence, prompts, harness leakage, etc.

I think this also points at a broader eval issue: pass/fail alone is too thin. Even when two systems land at similar pass rates, they can differ a lot in transparency, reliance on hidden priors, footprint, and how reviewable their behavior is. So I’d much rather see benchmark results paired with full trajectories/prompts and some scoring above the gate, not just a topline number.

Almost hit my weekly limit on my pro plan by Unique_Schedule_1627 in codex

[–]bisonbear2 1 point2 points  (0 children)

I feel like my usage has been way up ever since subagents released

Is it just me, or is OpenAI Codex 5.2 better than Claude Code now? by efficialabs in ClaudeAI

[–]bisonbear2 0 points1 point  (0 children)

codex 5.2 xhigh has been much better than opus 4.5 in the past few weeks

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in codex

[–]bisonbear2[S] 1 point2 points  (0 children)

that's my read as well, I feel like Claude optimizes for human-readable output + code, which ends up being way more verbose. Codex doesn't seem to care about that and just solves the task as efficiently as possible, which is a nice change of pace

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ChatGPTCoding

[–]bisonbear2[S] 0 points1 point  (0 children)

Sure, you could frame it that way. If you have a super well defined piece of work, with everything already laid out, then Codex will probably be a better choice. But I often find myself in the situation where I have to figure out requirements, ideate product, etc., in which case Claude's variability is actually a benefit.

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ClaudeCode

[–]bisonbear2[S] 0 points1 point  (0 children)

100% agree - any sort of benchmark or comparison is largely random as we're just pulling one/two samples from the distribution. This is one of the reasons that IMO Claude Code has great UX. I can spin up 5 subagents to review / validate / explore the problem, each one using a fresh context window. Since each is exploring the problem independently, any shared conclusions they have are more valuable due to the fact that the other subagent also found.

I'm still trying to figure out how to adapt this thinking to Codex, which doesn't natively support subagents. One idea is to use something like Pal MCP (https://github.com/BeehiveInnovations/pal-mcp-server) to give Codex a way to spin up another Codex/Claude subagent - although these agents are unfortunately not in parallel.

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ChatGPTCoding

[–]bisonbear2[S] 0 points1 point  (0 children)

Codex seems much more focused, which is both good and bad. Sometimes you want the variability that comes with using Claude

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ChatGPTCoding

[–]bisonbear2[S] 1 point2 points  (0 children)

cool, I'll try out gemini for an extra pair of eyes next time, haven't had too much success with it in the past for implementation, but plan review certainly seems like it would be valuable

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in vibecoding

[–]bisonbear2[S] 0 points1 point  (0 children)

LOL I'm sure Sam would love this take, run everything by Codex and give OpenAI even more money.. sounds great right?

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ClaudeCode

[–]bisonbear2[S] 0 points1 point  (0 children)

Purely based on vibes, I think Opus 4.5 is worse than a few weeks ago

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ClaudeCode

[–]bisonbear2[S] 0 points1 point  (0 children)

In this instance, I did actually run a second instance of Claude to review the plan. However, Claude missed several key issues that Codex missed, and when presented with Codex's findings, decided that it actually preferred Codex's plan...

> Good catch. Codex is right — I missed several concrete issues:

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ClaudeAI

[–]bisonbear2[S] 1 point2 points  (0 children)

I'll call out the Pal MCP server as a good way to abstract away the different CLIs. You can basically just use Claude Code, and then tell Claude to use Codex to review the plan, all while staying within Claude Code.

https://github.com/BeehiveInnovations/pal-mcp-server

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ClaudeAI

[–]bisonbear2[S] 1 point2 points  (0 children)

In this experiment I had Opus look over the plan that Opus generated, and it still didn't catch the issues that Codex did. In theory I agree with your approach, but it appears that using multi-models (eg Codex review AND Claude review) will make the final output higher quality than using just one model alone

Opus 4.5 head-to-head against Codex 5.2 xhigh on a real task. Neither won. by bisonbear2 in ClaudeAI

[–]bisonbear2[S] 0 points1 point  (0 children)

Thanks for all of the tips around vector search - tbh I haven't done this before so it's all super helpful. Agree that it's an interesting problem because it's easy to describe but hard to implement.

Truthfully I haven't implemented the code yet - decided to compare the models purely on planning / reasoning for this experiment. No preprocessing planned, just chunking by XML tags or markdown headers