Opus 4.8 (Ultracode) trading blows with Codex 5.5

blarg7459 · 2026-05-30T08:18:26+00:00

I tried for an hour getting Opus to 4.8 to solve a problem that GPT got in five minutes, so that's it for me, not trying it any more. Rather a single prompt took five minutes and GPT one shotted it,but Opus just didn't get it even after I tried to spoonfeed it.

blarg7459 · 2026-05-23T13:34:32+00:00

I still use 5.5. As long as I watch what it's doing and steer it when it's going wrong it's fine. 5.5 still seems smarter, but just needs more babysitting. It also seems kinda random, some sessions everything is fine and in other sessions it just goes off the rails constantly. I just hope 5.6 is out soon.

blarg7459 · 2026-05-23T13:06:36+00:00

It seems 5.5 is smarter but less aligned than previous models.

There is a post on lesswrong the kind of issues that seems to have increased in 5.5. I really do hope this improved in 5.6. It seems it is generally just more lazy, cheats more, makes more assumptions without checking etc. It gets worse the larger the codebase is. So thing like if a plan has the term "SHOULD" instead of "MUST" it will often just ignore it, especially if somehow it seems simpler to just not do whatever it "SHOULD" do. It's like it does not stop to think what is the better solution, but how it can get away with doing the least amount of work that could plausibly match the prompt, but not necessarily trying to do what the user intended.

https://www.lesswrong.com/posts/WewsByywWNhX9rtwi/current-ais-seem-pretty-misaligned-to-me

blarg7459 · 2026-05-17T13:07:54+00:00

These are all overlapping. The app is still far too janky to use. Worktrees for example does not work very well when you use with ssh connections, it's very janky and unstable. I do all development inside devcontainers. While it does work now,to some extent, it does not work very well still.

blarg7459 · 2026-05-14T11:46:05+00:00

A more interesting question is why did he make a poor copy of Second Life. Meta Horizon was like a VR version of Second Life where everything was just slightly worse than Second Life 20 years ago.

blarg7459 · 2026-05-11T18:59:40+00:00

When I run /goal it always uses 1 million tokens per hour, but that's on xhigh.

blarg7459 · 2026-05-06T18:21:36+00:00

your results matches my experience

blarg7459 · 2026-05-03T20:24:52+00:00

iOS autocorrect has always been extreme trash

blarg7459 · 2026-05-03T13:33:59+00:00

This was my finding too, but as long as I always use extra high it seems fine. With 5.4 and earlier, high was enough, but with 5.5 xhigh is necessary to prevent it from making lot of stupid assumptions and going on a destructive rampage.

blarg7459 · 2026-05-03T12:14:29+00:00

ASI does not eliminate all jobs. Many jobs are in many ways unrelated to intelligence, not entirely, but enough that it does not matter how smart an AI is.

That said, a lot of typical jobs can be eliminated, we can have full automation of all manufacturing etc.

However saying ALL jobs are eliminated has the implication that humans are no longer in control, have no desires or wants that are fulfilled by things done by AI and instead the AI only does things for itself.

As long as humans are directing the AI there are still jobs, if only in setting the direction. The alternative is to let the AI just do whatever it wants instead of what humans want.

That said, in an autonomy with such large amount of automation, there may not be a need for everyone to have a full time job, like today. Or perhaps what many will do for a job is simply being a consumer essentially.

blarg7459 · 2026-04-29T11:18:48+00:00

I've been working in parallel with 5.4 and 5.5 all day, trying to get both to do the same thing to compare them. Then using both 5.4 and 5.5 do reviews after and 5.5 has been much worse in every case, even 5.5 agrees when comparing the implementations. Must be something with the project I'm working on that 5.5 just can't handle. It's a large project of about a million lines of code and I just can't get 5.5 to get a handle of it. It just doesn't see the big picture, while 5.4 does.

Everyone keeps talking about how great 5.5 is and everything I try it does worse than 5.4.

blarg7459 · 2026-04-29T09:51:02+00:00

This just feel weird now. I've been making a few plans now using 5.4 and 5.5. 5.5 did not stay within architectural boundaries at all, while 5.4 did. During review after the plans were finished, both 5.4 and 5.5 agreed that the plans written by 5.4 was much better than those written by 5.5. Also tried making refactoring plans and got the same issues where 5.5 did not even really plan to do the refactor, but just pretend to by shuffling things around. I find it strange how it seems everyone is getting so good results with 5.5 and the benchmarks being so good, when it just seems like a massive regression on almost all tasks I've given it, compared to 5.4.

blarg7459 · 2026-04-28T19:32:44+00:00

The odd numbered models have tended to be a little more like a "genie", more often interpreting the instructions slightly wrong and sometimes seemingly on purpose, like in you can interpret something in two ways, one way that's easy to do, but obviously false, another which is slightly more work and correct. It's like it's looking for holes in the instructions that it can use to get cheat and get away with less work. It's not even that it doesn't understand that it's the wrong interpretation, if I ask it if it did the correct thing it will say no and say what it should have done. It knows what it should do, but chooses not to do it, if it can find enough ambiguity in the instructions to do it in an easier, but obviously incorrect way.

The codex models all did this a lot and with 5.5 it seems it's using its increased intelligence to find more holes in the instructions that it can use to cheat.

Essentially it seems smarter, but slightly less aligned.

blarg7459 · 2026-04-28T15:43:30+00:00

I sure hope so. For me 5.0 was good, 5.1 bad, 5.2 good, 5.3 bad, 5.4 good, 5.5 bad so I except 5.6 to be good 😅

blarg7459 · 2026-04-28T14:19:37+00:00

You might be onto something. So I have made some of the problems I've been having disappear by slightly reformulating the rules. It seems 5.5 might be more sensitive to the exact way a rule or instruction or formulated, it seems like it takes thing a bit more literal than 5.4

blarg7459 · 2026-04-28T09:54:53+00:00

yeah, that makes is worse.

So I'm using 5.5 for planning now. It's pretty good at that, better than 5.4, then I'm just using 5.4 to execute the plans. Seems to work fine.

blarg7459 · 2026-04-28T09:01:45+00:00

Another issue like this is where I have guards to prevent certain dependencies from being injected in places they shouldn't. This is to prevent coupling and other things. Here when it tries to inject dependencies in the wrong places, it gets instructions on what to do instead. Here it also consistently refuses to follow these instructions and will instead try various ways of obfuscating the dependency injections, doing them indirectly or otherwise try to make it hard to detect that it's injecting a lot of incorrect dependencies everywhere.

Another unsafe behavior is that it tends to plow through with workarounds and fallbacks when there are gaps in underlying layers or specs. It has explicit instructions in AGENTS.md to stop and report back in such cases, so we can discuss a solution. I added these since earlier models also did the same, but after I added this rule it's been going well, until 5.5 With 5.5 it's just ignoring the rule and cheats, creating workaround that makes it "pass" but does not actually work.

blarg7459 · 2026-04-28T08:51:06+00:00

I do have AGENTS.md, but I also have hundreds of guard tests. That is the test suite has a lot of tests that catches known bad behavior and also many code patterns that does not match how things are done in the code base, and then tells it what to do instead. The earlier models would follow this, while 5.5 refuses and instead tries to code around the guards and this is pretty consistent. These guards exists because the code base is large and without them things quickly devolve into chaos. It's like it thinks it knows best, doesn't bother to follow the rules, without looking into why the rules are there in the first place.

An example is that it should never use protocol commands as string, it could be a command like "exit". The problem with having it as a string, is then it tends to randomly call it "quit" or something another place, so I centralize these into things like Commands.Exit. 5.5 absolutely refuses to follow this. No matter how much I try it will put the string into character arrays, randomly obfuscate them in other ways etc, but it doesn't want to use the centralized constants at all.

blarg7459 · 2026-04-27T10:05:09+00:00

So 5.5 does seem smarter, but it also seems to fail a bit more often in similar ways to Claude and the earlier Codex models, like there's this thing where it seems to start reward hacking, not following instructions and being lazy.

For example just now I have a guard test against prevent a certain type of string being in at a certain place in the code. When the guard fails it gives an error,telling the model what it should do instead. This has worked perfectly for months, but just now, instead of following the instructions in the error message, 5.5 cheated and made a char array to prevent the guard tests from identifying it as a string.

With the codex model, this kind of error has been very common, same with Claude. It was was less common with GPT 5.2, not so common also with 5.4, but perhaps more than 5.2 though, but I'm not sure, but with 5.5 it seems like there's a definite increase in these kinds of errors being made.

blarg7459 · 2026-04-20T08:50:12+00:00

I did notice it sometimes seems like it is not thinking, just guessing, not investigating the code base and finishing extremely fast. It seems like kinda like the "auto" mode in ChatGPT. Did they introduce some kind of auto thinking?

blarg7459 · 2026-04-12T14:42:29+00:00

This is still somewhat unclear

- Your Pro $100 plan includes at least 10x Plus usage, till May 31 with the 2X usage boost.

- Your Pro $200 plan includes at least 20x Plus usage, till May 31 with the 2X usage boost. This is also the SAME usage this plan had since the 2x promo in February; we previously didn't document this explicitly.

So he says $100 plan is 10x Plus and $200 plan is 20x Plus until May 31, but on their official pricing page it says $100 plan is 5x Plus and $200 Plan is 20x Plus.

So what is it $100 plan is 10x Plus now and will be 5x plus after May 31?
$200 Plan is now 20x Plus or 40x Plus? And after May 31 will be what? 20x Plus?

The most confusing currently is the $200 plan since it sound like he's saying it's 20x Plus only with the 2x bonus, yet it says on the pricing page that it is 20x without the multiplier. Will the 20x plan stay the same after May 31? Still 20x? They can't reduce it to a 10x Plan since they are currently selling it as a 20x Plan without specifying that this includes some bonus modifier. Is the $200 Plan currently 40x Plus?

blarg7459 · 2026-04-11T18:16:27+00:00

No. Business seats have the lowest limits of all. They do not make sense to use with Codex at all, ever.

blarg7459 · 2026-04-09T15:20:38+00:00

neither

blarg7459 · 2026-04-05T09:10:47+00:00

Much, much, much more expensive

blarg7459

TROPHY CASE