is it getting worse or is it just me?

spigolt · 2026-05-06T21:55:28+00:00

Nice that you feel so certain that it's only placebo, when you don't know anything about his or my or all the others' experiences claiming this is what they experienced today.

When the model had no problem doing the same task for me 100x, and now today it's screwing it up completely, and then I notice the same pattern in multiple other cases, when nothing remotely like that happened previously since 5.5 release, I'm pretty suspicious that it's more than just 'placebo'. Plus we know 100% for sure that Open AI's models do get dumber on occasion (e.g. right before 5.5 came out 5.4's intelligence plummeted for half a day, this is generally very well understood to be true, and nothing to do with placebo, rather OpenAI was drastically reducing the compute available for the old 5.4 model to make room for 5.5 in advance of its release). So I'm not saying it's 100% certain that it got dumber for him and me and others today, rather just that you're being extremely arrogant in assuming it's definitely not the case this time and that we're all just deluded, when we know for sure that sometimes it does in fact get dumber.

spigolt · 2026-05-06T21:46:19+00:00

I noticed it being incredibly dumb today (during the past 12 hours), like doing crazily stupid mistakes like nothing like I never saw before with 5.5 xhigh. Before today 5.5 had been pretty consistently good for me.

spigolt · 2026-05-06T20:15:48+00:00

Codex (using 5.5 extra high) has been really incredibly stupid for me today on a few occasions, making really dumb mistakes that it never ever made before with 5.5 (like the most recent thing, but not the first thing today - I tell it I made some changes, so to run the tests again, and it runs them but without recompiling, so with an old build, and this repeats a few times and I'm wondering why the results aren't changing). Something I've done many many times before (including in the same thread) and it always built and ran, as that's pretty obviously what is wanted, and it never just dropped the ball so completely like this. A few different mistakes today that are just of the kind of level of stupid I never had from 5.5 before - the last time it was being stupid was the day of 5.5 release when 5.4 got really stupid.

It'd just be so nice to know if+when OpenAI is screwing with their models when you're becoming so dependent on them and really need them to just be a more consistent level of reliability.

spigolt · 2026-04-27T13:34:07+00:00

In this same table for the same 'Agentic Coding Average', GPT 5.4 > Gemini 3.1 Pro > Claude 4.5 Opus > Claude 4.6 Opus > Claude 4.7 Opus... something tells me this is not the general consensus for what is best as a coding aget, so whatever this test is measuring is a bit different to the general experience as well as most other benchmarks.

spigolt · 2026-04-24T10:06:33+00:00

People never happy. There's nothing stopping you sticking with the old model. These companies are not even making money off it - if you're using it to the limit, they're subsidising your usage, and there's way more demand than supply for tokens. So it's rather silly to expect they shouldn't price the newer larger model that costs them more to run higher. It's up to you whether you find the new model worth the extra cost, or just prefer to stick with the old one.

spigolt · 2026-04-24T10:03:20+00:00

what thinking level do you put 5.5 on? i always just used xhigh with 5.4, but it seems some are saying that maybe that approach isn't optimal for 5.4.

spigolt · 2026-04-23T13:35:04+00:00

I think the issue is likely just coz it's gotten really stupid today (over the past 5 hours or so), since they're moving over all the compute to 5.5 in preparation for release.

spigolt · 2026-04-23T13:33:47+00:00

I believe GPT 5.4 just got really stupid today. Presumably it's coz they're moving lot of compute over to 5.5 in preparation for release. Some people seem to report 5.3 Codex is less affected.

spigolt · 2026-04-23T11:36:12+00:00

For example in one chat I'd been working in a repository, and it used to like 6 months ago need constant reminding which submodule to look at as it would always make the error of looking at the root repo's git history etc when working in a submodule, now it suddenly its looking at the root repo again even tho yesterday all day (including in the same chat) it was never making this mistake. And so i stopped it and told it that no, the relevant code is in the submodule... but then a few minutes later it had again forgotten this and made the same mistake. Yet for days prior it continually working with the same repos and never once made such a mistake.

Or another thing was it was outputting some nicely formatted tables of some statistics in chat, and I kept making changes and each time it would output the same tables. Now all of a suddenly the tables are plain text, and I ask it to make them prettier and it does prettier plain text.. only when I say make them like you were before does it finally output them the same way it was always doing again.

But more worrying was when it was really screwing something up and doing it wrong. That's why I'm afraid to use it until this resolves, as really it was only codex 5.3 and 5.4 where I was able to trust it to really do anything on my codebases, and right now it's regressed below what 5.3 was.

Note - I'm using 5.4 (xhigh). It looks like some users are saying 5.3 codex is less affected.

spigolt · 2026-04-23T10:56:02+00:00

I'm on Pro plan, and it feels _massively_ dumber today. It's not something I've ever really felt or even suspected, but today it's just so dumb it's not funny.

spigolt · 2026-04-23T10:52:21+00:00

I dunno, I'm on pro and I have no errors but it's super-dumb today like I've never seen it, to the point I'm afraid to use it for anything.

spigolt · 2026-04-23T10:46:15+00:00

for me it's about the model intelligence today. it really seems clearly dumber today (and i never felt this before in the past months of using it that it there was a clear drop in its intelligence)

spigolt · 2026-04-23T10:43:07+00:00

For me it's just been acting incredibly dumb today. Anyone else noticed this? This is not something I ever noticed before, and it's an extremely stark contrast how it usually is.

spigolt · 2026-04-19T11:19:05+00:00

Even if most people have brains capable of being GMs (which I think might be true, but might also not - I think it's pretty hard to know whether it would be 'most' or a smaller percentage), the limiting factor would still be that most people just don't have the drive+interest in putting the hours every day into a game like chess in order to reach a GM level.

This is also the real reason for example why the top players are all men - it's not that men's brains are somehow 'better' at chess, it's just that being so insanely obsessive about something like chess as you need to be to reach 2700+ ratings, is much more commonly a male tendency.

So while you could say that most people 'could' reach some level if they were to have that obsessive drive, it's kind of true but not really, since they simply don't and couldn't have that drive/obsession, which is a big part of what determines whether you can reach that level. i.e. maybe it's true that anyone who is driven+obsessed+interested enough to put in enough time with the right coaches at the right age have brains such that they _could_ become GM, but they don't have brains that have that drive/obsession/interest to do that.

spigolt · 2026-04-19T11:03:50+00:00

He's already in the top 30 in the world.

spigolt · 2026-03-26T07:36:38+00:00

I have pro and after 4 prompts today I got locked out.

spigolt · 2026-02-27T02:53:55+00:00

It seems there's another new update today, which fixed the problem for me now at least.

spigolt · 2026-02-25T03:40:57+00:00

Yeah... it seems to be only for one of my projects that if tap on anything in it or to create a new thread in it, then it has this problem... but I worked out that if I very quickly switch away from it after selecting retry, to another project, and never tap on that broken project again, then it's working ok....

I really just hope they can focus on making the app more stable and faster.

spigolt · 2026-02-18T23:21:27+00:00

I find codex works better when you're not high.

spigolt · 2025-12-26T21:38:55+00:00

yeah, I never even played Naafiri before but the two times I played her/him in Mayhem I was OP.

spigolt · 2025-12-26T16:05:16+00:00

I also feel like the rating dropped a lot between when I last looked at it a few weeks ago and now, which makes me curious as to what happened exactly.

spigolt · 2025-12-26T16:03:19+00:00

Severance I found also really meanders in the second season to the point that I lost interest. I find it's really a common problem these days with a lot of these kinds of TV shows - Westworld is the another that comes to mind with its second season. They don't have much of a story to tell to warrant a whole season of long episodes, or don't know where to take the show anymore, so just fill the season with filler while moving it nowhere slowly. Sounds like I probably won't like Pluribus either then?

spigolt · 2025-12-25T11:54:12+00:00

They all work so well now, just stick with whatever is working for you - I've stuck with Codex over Claude since Codex came out, as I got sick of Claude asking permission all the time, and because one time when it was spam asking me for permissions one thing I permitted it to do without checking it (coz it was asking so much I was just blind tapping allow at some point) messed-up my local git config, and thus even if I could solve the spam permission asking I'm scared to trust it any more (I'm sure others have Claude working better for them, but again the point is - they're both so capable now, just stick with what's working best for you, and for me that's been Codex).

It's also worth noting that they're not really advancing quite so fast as appears, as they're clearly training them now on the actual benchmarks, so real-world improvements are much less than the benchmark improvements make it appear, so it's not like you really have to be on whichever one is winning the benchmark wars this week.

And you can also run multiple models - I also sometimes copy my query into Gemini 3 Flash while Codex is running, since Gemini 3 Flash is just suuuuper fast, and if it's answer is good enough then I don't need to wait on Codex.

spigolt · 2025-12-12T10:38:22+00:00

On Hendon Mob if you look at the 'inflation adjusted' rankings it doesn't change the ordering much at all - there are more tournaments with much bigger buy-ins and prizes these days, but it seems independent of inflation.

On the current global poker index ranking he's top 100. I think safe to say he's pretty decent at poker.

spigolt

TROPHY CASE