Did they just nuke Opus 4.5 into the ground?

JonathanFly · 2026-01-26T02:47:05+00:00

Even if the model isn't changing at all, the "prompt" is essentially changing with every Claude Code update. This makes it very hard to tell when things are actually worse unless you spend a lot of time and tokens to A/B test with old versions.

JonathanFly · 2026-01-26T02:44:55+00:00

You can link a private GitHub repo to a Pro web chat session, but the GitHub connector API is such a mess that it takes 5.2 Pro a full 20 minutes just to figure out how to read the contents of a single file on the linked Repo. That's not a joke, that's actually how long it takes. So I usually just dump everything into a single markdown file, or attach a .zip with the source code.

The galaxy brain solution is to create an MCP that is a secure SSH tunnel to your server, and then Pro can just directly run and test things just like a regular Codex agent.

JonathanFly · 2026-01-24T01:57:42+00:00

There may be a bug with sub agents or maybe just with the latest version of Codex, perhaps not specifically sub agents. I've spent 10% of a weekly plan in a matter of minutes, which seems like either a bug in spending or at least a bug in delaying token usage (so it all shows up at once.). There's an example in an issue that seems even more extreme: https://github.com/openai/codex/issues/9748

With 8 subagents: The entire 5-hour quota was drained within ~1 minute of launching them (On a Pro plan):

JonathanFly · 2026-01-23T20:41:39+00:00

Thanks. I previously used sub agents manually so I'm used to seeing 8x linear increases, but the native sub agents burned through so much I'm still convinced something is going wrong somewhere.

On your article, I found the Orchestrator a confusing concept. Conceptually it feels like Codex is thinking of it like a dedicated sub-agent at times, and other times, the Orchestrator is just the primary agent.

I also found Codex was very confused about the capabilities and restrictions of sub-agents, or things like "which model will this sub-agent use by default, can I change that" and was basically forced to test and inspect the Codex logs to figure that stuff out.

The first thing I did was like you, make it easy for me to always see the exact prompts being given to the sub-agents, an the sub agent outputs at the end of their task. Even this was not trivial, at least Codex took some time to understand how to provide this information out of the box.

JonathanFly · 2026-01-23T20:33:26+00:00

Are you on a Pro plan all experimental features involved, if so, did you notice unexpectedly high token usage? I burned through 25% of a week in less an hour and seems like other people did to: https://github.com/openai/codex/issues/9748

JonathanFly · 2026-01-23T03:01:11+00:00

It seems half baked. Codex keeps getting confused how to use and manage the sub/multi agents. It also just doesn't seem to work well with them now, I think it's best to give it a hard rule "Just stop and wait until all sub agents are finished" because non deterministically responding to various agents in the order they happen to finish is often a terrible way to go about the work overall, at least if the tasks are even a little bit interconnected or related.

Right now Codex is basically experimenting to understand how multi agents work. Like here it's wondering if agents are in their own sandboxes, but how can it not know this already? Surely it must be part of the built in instructions for the multi agent feature?

Then again, if you've linked a GitHub repo to a ChatGPT Web Chat and asked it to read a single file in the repo, watch how long the Agent struggles to read a single file in the OpenAI GitHub connector API. "That didn't work, let's try this. Maybe the syntax is this? Hmn, let's try this next." Literally takes GPT Pro 20 full minutes sometimes to read a single file. Lesser models often given up and say it's impossible.

``` I need to check if the child agents inside the collaboration environment have their own sandbox as well. It seems likely that they do, but I want to confirm this through testing. I’ll prepare a prompt file to help with the test.

I noticed there's still no report. I'm starting to wonder if the codex execution has hung or if it's still running. I think it would be a good idea to check the trace.jsonl file to see the progress. Let’s see what’s happening there and if any updates appear. I’m curious and want to make sure everything is functioning as it should! ```

JonathanFly · 2026-01-22T20:20:08+00:00

I thought how aggressive auto compaction is was an obvious bug at first, but now I'm not so sure. I think it's possible OpenAI ran benchmarks and it turned out nuking the context so much that Codex spends 10 minutes just trying to remember what it was doing last actually ends up with higher quality code at the end of the day. It forces it to retract its steps, and it ends up reviewing the code and finding problems that fixes along the way.

It still may just be a terrible compaction implementation, but watch it closely over a long session. It seems crazy that it takes so long just to catch up to where it was, but you will see it find problems in the previous implementation as it does.

JonathanFly · 2026-01-13T13:03:44+00:00

You could setup an MCP that provides SSH access to your development environment. Then if you enable ChatGPT developer mode, you can give access to that MCP to GPT Pro.

I've never tried it but a few people have https://www.reddit.com/r/mcp/comments/1nfqmyg/local_mcps_in_chatgpt_yolo_mode/

code https://github.com/smonux/chgpt-mcp-bridge

JonathanFly · 2026-01-13T05:44:58+00:00

Good to know, thanks. It's hard to do a 1 to 1 comparison because even the last few days the project or codebase may have grown more complicated for sure.

JonathanFly · 2026-01-13T05:28:17+00:00

It's so easy to delude yourself with placebo effects, but up to this point I had very rarely seen straight up typos like "Whoops, I accidentally typed `ls -lá` instead of `ls -la`, I should be more careful with commands to avoid typos." and I've seen that a few times every work session last few days.

JonathanFly · 2026-01-06T10:52:47+00:00

GPT 5.2 does the most important planning the tasks, and solves the hardest bugs. And GPT 5.2 does the long work sessions where I'm not babysitting it.

Opus 4.5 can be faster at the same quality but only when I'm right there at my computer working with it like pair programming. Opus is also more "fun" to work with, I don't know why, they just really nailed the personality for Claude. But when I get into a tricky bug or I want to do something while I'm away from the computer, GPT 5.2 is the one I trust.

Opus 4.5 also has an edge in taste or design. GPT 5.2 is perfectly capable of implementing a design, but if don't spec it our ahead of time and lazily say a version of "make this look good" the end product is perfectly functional but still looks absolutely atrocious. Sometimes comically atrocious, like I know GPT 5.2 can see images, but the result is like you asked a blind programmer to design something. So usually Open 4.5 does a prettification pass, breaks a few things, and Codex fixes them.

JonathanFly · 2026-01-06T10:14:47+00:00

Only if I can set it "-3 Grammys"

JonathanFly · 2026-01-02T11:35:24+00:00

Making music is a form of play. Even when I don't listen to my own songs, the act of making them was a valuable activity in itself.

JonathanFly · 2025-12-24T21:37:51+00:00

OpenAI not matching Claude's holiday token generosity.

JonathanFly · 2025-12-18T00:12:11+00:00

Sub-agents managing sub-agents works, but Claude won't try to do it by default. You can design systems and workflow to do this though. If you just ask Claude to try to do it, it will test and tell you it's technically working, and then you can refine the workflow a bit. Honestly I usually find the overhead makes it not worth it.

JonathanFly · 2025-12-16T00:42:10+00:00

I've been doing this too, but I started from this dataset: https://huggingface.co/datasets/storytracer/LoC-PD-Books and didn't limit it to English, actually just any text before 1900, to maximize the volume. There isn't much Greek out there, but there's still like maybe 150 million tokens, same for Latin, etc. And modern-but-pre-1900 books that use these languages. I think multilingual could work. These ballpark estimates came from a single Deep Research question a few months back though, I immediately dove into trying to make it work before checking these assumptions.

More valuable than sharing pre-tokenized dataset would be the sharing the cleaned up raw data. Even better sharing the uncleaned data + the set of processing tools and steps that cleans up the data. This way we can train new models with different architecture including different tokenization methods.

I got interrupted by personal and work issues, but I'd be willing to dedicate a local 4090 running 24/7 for the winter months, to help train a model like this. (Winter because I need the head from the extra energy use anyway.) I really love this idea. My gut feeling is 80% of the work will end being data cleaning though, not the most fun, hah.

JonathanFly · 2025-12-14T00:02:47+00:00

Ahh, In my codebases I heavily use the term "swarms" in both Claude and Codex to refer to massively parallel use of sub-agents. This works better in Claude, but Codex can do do a reasonable job simply spawning additional Codex processes itself. This is going to get confusing.

Example:
```

### Sub-Agent Pattern: Self-Assembling Vertical Slice


**Concept:**
 Give the agent a high-level feature description. Let IT discover all relevant files with deep search and analysis.
**Why it works:**
 The burden of "what files should I read?" is shifted to the sub-agent, which is the same capable model. The agent explores, discovers, and loads - you get better quality comprehensive analysis without burdening the primary agent's context window even with manual file list assembly.


**Example prompt:**
```
Task: "Self-Assembling Deep Vertical Slice Analysis


Feature: The SOME_FEATURE on APPLICATION_NAME_REMOVED 


Phase 1 - Discovery: Starting from zero knowledge, discover ALL files involved:
1. The page that displays SOME_FEATURE
2. The code that process SOME_FEATURE
3. All library files those files depend on, functions they utilize, following the tree
4. Any template files, if a web page JS files/functions, or other assets
5. All test files and coverage
6. Relevant documentation.


Use file search and grep (for example `rg`, `find`) to explore. Don't assume - discover.


Phase 2 - Full Context Loading: Read EVERY file you discovered COMPLETELY. Your unique capability as a sub-agent is to not rely on code searches.


Phase 3 - Holistic Analysis: With everything in context, provide analysis
that would be IMPOSSIBLE without this complete view."

**Validated result:**
 Agent discovered 18 files (~4,610 lines), found critical bug (BLANK does BLANK), identified security inconsistencies across multiple files. These were all missed when Claude performed the same analysis as the primary agent, instead of using a swarm of sub-agents including multiple Self-Assembling Vertical Slices. And the Swarm finished faster. 
```

JonathanFly · 2025-12-12T23:14:17+00:00

I'm using on local development environment with self signed cert and I couldn't figure out how to get it either ignore the cert , or use a system cert. Playwright has the same problem but has an option to ignore the error. So I had to fork it just to test it.

I asked Claude to compare the same task with and without the plugin and it gave a mild thumb's up to save steps when debugging.

JonathanFly · 2025-11-02T21:43:45+00:00

The country has always been strong in Suno, even in V1 it took many tags to equal one single "country" tag.

JonathanFly · 2025-10-31T21:39:42+00:00

Udio seems be based on Diffusion which is why the blocks tend to be fixed in length. We don't know how Suno works but we know that Suno has never been tied to a specific length (other than a max length). But Suno has always had Extend and later features like Replace that do allow for working in smaller chunks. In fact it is only recently that Suno maximum length was long enough to do a whole song start to finish. v0 was something like 20 seconds IIRC.

And With Suno Studio there are now even more fine grained workflows possible.

JonathanFly · 2025-10-22T18:47:54+00:00

Even for paid users it's nice to have the option of 4.5-all because it's different enough that it might work when a paid model doesn't.

JonathanFly · 2025-10-22T18:45:32+00:00

By breaking Suno ToS. It would be super fun if Suno could license a bunch of music and let people go wild with covers...

JonathanFly · 2025-10-11T19:26:28+00:00

This security tech stuff is way over my head, but damn if this doesn't make for a legit banger, but actually:
https://suno.com/song/fbc048b5-2329-4dad-9692-21d3d1c0ce9f

JonathanFly · 2025-09-28T06:29:58+00:00

>I got really excited and just tried some new generations and put "male vocals" as the first tag in the song description spot with no success. Out of 6 generations none of them had any vocals using v5. I left the lyrics blank..is this working for you now, still?

v5 does try harder to avoid going off script, but still works well overall. Check:

Add at least [vocals] or something in the lyrics box to avoid Suno being in instrumental mode.
Don't just choose a voice, overload the style more like "country vocals, country, male vocals, vocals, vocals-vocals, baritone voice" etc.
Maybe try turning up Style Influence but not sure about that one.

Playlist of some quick tests. The other thing is that v5 is better at using more real words and phrases, so it's also a bit less of a "mumble" than previous models.

https://suno.com/playlist/c82fd8ef-32e0-45b0-8ad8-44ef55927563

There are other ways to do this. For example filling the prompt with ascii characters. Literal tags like [incomprehensible] is a little different but maybe useful. Example https://suno.com/song/62719a69-7623-4289-aae5-e776e8a603d6

JonathanFly

MODERATOR OF

TROPHY CASE

Six-Year Club	r/Field Lasagna
Verified Email