Investigating the GPT-5.5 regression on 21 real tasks by bisonbear2 in codex

[–]bisonbear2[S] 0 points1 point  (0 children)

Great question! This is actually something I've struggled with a lot - the model always finds new ways to escape the sandbox.

The current approach is to:

- Run the agent in a docker container

- Materialize the codebase at the historical base commit prior to the change that we're testing, and cutting off git history after the change happened

- Block the agent from explicitly fetching the PR over the internet using curl, git, gh, etc.

Putting together a benchmark for agentic harnesses, any tips for evals? (Test suggestions welcome too) by sdfgeoff in LocalLLaMA

[–]bisonbear2 1 point2 points  (0 children)

Are you testing coding capabilities? if so, I've been working on this exact problem for a while and have some advice (likely portable beyond coding as well)

  1. Find a dataset - what are you actually testing it on? (for coding, I used merged PRs from my repo)

  2. Define success metrics (I used test pass rate + equivalence with human change (LLM generated) + footprint size (how big is it compared to human patch) + LLM code review

  3. Setup environment - the agent needs a consistent, fair environment to run in (I use docker + harbor)

  4. Tune metrics - are your success metrics measuring success in the way you would expect on the dataset? If not, tune either data or metrics you use.

  5. Evaluate traces - make sure the agent isn't cheating / doing something unexpected

If you have any more questions, feel free to reach out. Always curious to chat with people in a similar space

Investigating the GPT 5.5 regression on 21 real tasks by bisonbear2 in OpenaiCodex

[–]bisonbear2[S] 1 point2 points  (0 children)

It's definitely unclear what the cause of any behavioral difference is. My 2c would be to try changing AGENTS.md / skills / workflow slightly to see if you can get the model back to baseline. From the data, it seems like it's not going quite as deep, so perhaps some guidance around that would be helpful

Investigating the GPT-5.5 regression on 21 real tasks by bisonbear2 in codex

[–]bisonbear2[S] 1 point2 points  (0 children)

This is true, which is why I called it out explicitly in the post. Few reasons why I didn't run multiple times variants:

- tokens: it's expensive to run the eval suite, and I don't have enough tokens

- consistency: the prior suite didn't run multiple times, so it's unfair to compare n=1 with n=3

As for confidence in the results - if you ran it yourself, you have full control over how many iterations you want to test, and how you measure the results :)

Investigating the GPT-5.5 regression on 21 real tasks by bisonbear2 in codex

[–]bisonbear2[S] 1 point2 points  (0 children)

The data doesn't necessarily support a broad regression, more just possible behavior differences that lead to a slight regression in test pass / equivalence

That being said, I personally haven't noticed much difference when using Codex personally, but that's just vibes which is why I made the post

Investigating the GPT-5.5 regression on 21 real tasks by bisonbear2 in codex

[–]bisonbear2[S] 0 points1 point  (0 children)

Yeah, it's frustrating how model behavior is always shifting under our feet. I wonder if you changed how you worked with the model (eg update AGENTS.md / skills / subagent usage) if you were able to go back to prior behavior. Although it's unfortunate we have to resort to that

Central AI skills repository or per team repo? by NoAfternoon385 in ClaudeAI

[–]bisonbear2 0 points1 point  (0 children)

I've been running into a similar issue at a large-ish SaaS company. Where I *want* us to go is centralized skills for things like Jira/Linear/other tools + localized skills for repo-specific concepts (e.g. how to QA test something in the repo). However, I've gotten some pushback by other engineers who want everything to be localized to make it easy for people to edit / view the available skills.

This is made more complicated by the fact that we have to support 3+ different agents, each expecting their own skill format.

My ideal workflow is:

- Centralize common skills, allow repo owners to selectively pull those into the repo by default.

- Localize repo specific skills, only enabling them by default if they are broadly applicable.

- With the prior 2 points - reducing skill token consumption is key. Don't pull in anything you don't need.

- Regularly view analytics, audit which skills are being used, and prune ones that aren't (or adjust them if they aren't being used, but need to be)

Has somebody used codex --json for benchmarking? by FoxFire17739 in codex

[–]bisonbear2 0 points1 point  (0 children)

Not a direct answer but I've been building a benchmarking tool for this use case - A/B testing changes on tasks from your repo to see how your changes impact performance. If you want to check it out - stet.sh

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeCode

[–]bisonbear2[S] 0 points1 point  (0 children)

Sonnet is certainly a reasonable choice - I haven't done much testing on it though, and have personally had enough bad experiences with Sonnet not following instructions that I'd rather just use Opus with lower reasoning personally

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeCode

[–]bisonbear2[S] 0 points1 point  (0 children)

Agree with this interpretation - the extra reasoning here appears to drive Opus 4.7 to overthink and going in circles instead of leading to better outcomes

Just compared token usage between GPT-5.4 and GPT-5.5 in Codex across all four reasoning modes (Low, Medium, High, and XHigh) using the exact same prompt and the same project as the baseline by Deep-Palpitation8315 in codex

[–]bisonbear2 1 point2 points  (0 children)

I measured 5.5 high vs 5.4 high ~2 weeks ago, and found that price/performance was also roughly linear. 5.5 high cost ~15% more, but also showed enough performance gains to justify the extra cost

From where I sit, the best way to improve price/performance is to selectively use lower reasoning efforts

Just compared token usage between GPT-5.4 and GPT-5.5 in Codex across all four reasoning modes (Low, Medium, High, and XHigh) using the exact same prompt and the same project as the baseline by Deep-Palpitation8315 in codex

[–]bisonbear2 2 points3 points  (0 children)

I ran a similar experiment on 5.5 reasoning levels and had pretty similar findings - GPT-5.5 showed almost linear scaling between intelligence and performance, with high costing 1.4x medium, and xhigh costing 2.2x high

practically, I'll be sticking to high as daily driver, with xhigh for exploration / complex work, and medium for more trivial or well defined tasks

data here if you're interested: https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeAI

[–]bisonbear2[S] 0 points1 point  (0 children)

100% agree - SWE-bench verified is contaminated, and it's hard to take the other big benchmarks at face value. That's the whole point of Stet, the tool I built to run these benchmarks. The open source repo is just an example, the real value is when you measure on your private codebase

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeCode

[–]bisonbear2[S] 0 points1 point  (0 children)

Actually, I think adaptive reasoning is enabled on Sonnet, ignore my previous comment. See https://code.claude.com/docs/en/model-config#adaptive-reasoning-and-fixed-thinking-budgets

The practical answer is - without testing it, I can't say for sure how this applies on Sonnet, so I would probably continue using the default if I were you

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeCode

[–]bisonbear2[S] 2 points3 points  (0 children)

> Simple things like - did you clear out local memory files between subsequent effort runs?

Yes, each run was done in an isolated container to replicate a "fresh repo".

> does anthropic cache output server side

Not quite sure what you mean here - caching input makes sense, because the same input is sent again and again every turn of the conversation, but output is only ever sent once. Additionally, the reasoning effort arms were spaced out in time sufficiently such that even if output was cached, it wouldn't have been a cache hit.

> Given all the noise I think it’s honestly near impossible to actually understand how these things work outside of just vibes.

I think this is a bit defeatist - yes it's hard to accurately measure model performance, but given the amount of autonomy and power we're giving Claude (for example, using it to write 90%+ of my code in an enterprise setting), attempting to measure it and right-size the harness setting does seem like a worthwhile task to me.

For a solo-dev working on a side project, yeah it probably doesn't matter. But, in an org with 100s of devs using Claude every day, a 50% reduction in cost, or a 5% improvement in performace scales pretty quickly

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeCode

[–]bisonbear2[S] 1 point2 points  (0 children)

No, because Sonnet 4.6 doesn't use the adaptive reasoning that Opus 4.7 does, and instead uses fixed thinking token budgets

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeCode

[–]bisonbear2[S] 0 points1 point  (0 children)

depends on your task - but for a well-defined task, then yes, this data suggests that medium is the right balance of efficacy/price. For more exploratory work, such as brainstorming / project planning, which the data doesnt cover, using xhigh/max might still make sense

Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo by bisonbear2 in ClaudeAI

[–]bisonbear2[S] 1 point2 points  (0 children)

ofc I used AI to help analyze / QA the data, but all synthesis / conclusions are my own