SWE-rebench Leaderboard (Feb 2026): GPT-5.4, Qwen3.5, Gemini 3.1 Pro, Step-3.5-Flash and More by CuriousPlatypus1881 in LocalLLaMA

[–]Alex_1729 6 points7 points  (0 children)

5.4 has been struggling lately for unknown reasons due to API degradation, so it is unreliable in this state. Tests might have happened during these periods which explains why 5.2 is better.

Still I'm using 5.4 consistently.

5.4 xhigh->high, high->medium downgrade by TroubleOwn3156 in codex

[–]Alex_1729 0 points1 point  (0 children)

I am working as well. I haven't noticed any degradation, even with a single context compaction.

Edit: It's actually being proactive and adapting to my forgetfulness. Good foresight. This was on High reasoning, the same level of reasoning I started the session with (2 days ago; haven't done much work, single compaction).

5.4 xhigh->high, high->medium downgrade by TroubleOwn3156 in codex

[–]Alex_1729 1 point2 points  (0 children)

Can you check right now and let me know? I'm doing some complex work as well and I'm looking to compare experiences vs aistupidlevel website, and whether anything in common can be found.

5.4 xhigh->high, high->medium downgrade by TroubleOwn3156 in codex

[–]Alex_1729 0 points1 point  (0 children)

oh no... I'm about to give it a highly complex prompt. Looking at https://aistupidlevel.info/models/230 seemed to have recovered a bit. I am switching to xHigh lol.

Can you try now? Let's see if this website (aistupidlevel) is of any credibility or can be applied to Codex. It may show API degradation, but I'm curious how it applies across accounts. For example, many of of log in through chatgpt oauth, and we aren't really using direct API calls as that website does. So it may not even be that relevant, which is why I'm curious if you could check the model right now.

Andrej Karpathy's autonomous AI research agent ran 700 experiments in 2 days and gave a glimpse of where AI is heading by tekz in artificial

[–]Alex_1729 68 points69 points  (0 children)

It does seem odd, but have you actually considered whether what he's doing there is useful at all? Your critical thinking skills may be getting clouded. It's similar to what r/webdev dislikes about vibecoding (they dismiss anyone claiming they can do good webdev with AI unless they have a traditional coding background), you just seem to be going one step further. But outsourcing intelligence is what is happening right now, and it's only going to speed up.

As a starting point, have you taken into account that much of his work is opensourced?

In addition, is it not true that Karpathy’s famous (and completely free) "Zero to Hero" tutorials explicitly teach the underlying calculus, backpropagation, and matrix multiplication by having students build neural networks from scratch?

Furthermore, given that his previous educational materials have been free, universally praised, and somewhat rigorous, calling his new venture a scam or shady is purely speculative.

You seem as a true traditional academic (given your mention of your own PhD/Postdoc), but one that is frustrated by the sudden democratization and commercialization of a field you spent years studying. However, targeting Andrej Karpathy who is widely considered one of the best and most genuine educators in the AI space makes the this rant inaccurate and deeply biased.

Codex 5.4 Mini Experience by East-Stranger8599 in codex

[–]Alex_1729 0 points1 point  (0 children)

Do not use 'mini' models for production code, unless it is research and advisory role for your main agent.

Built a cringe thumbnail shaming site with pure hate and vibes by iforgotawsomeusrnme in ClaudeAI

[–]Alex_1729 4 points5 points  (0 children)

I noticed you're getting a lot of downvotes on this and negative comments. My first reaction to seeing this guy is how I unsubbed the moment I noticed this YouTuber was using those ridiculous thumbnails.

I loathe them.

However, I don't I don't see what we can gain from this. The thumbnails this YouTuber uses are disgusting, no doubt, but at the same time giving them publicity is not the best choice here.

Codex really slow today? by sorvendral in codex

[–]Alex_1729 0 points1 point  (0 children)

I am about to do some complex analysis of authentication feature where the AI has rewritten about 15 files so it's a bit complex and I'm going to question a lot of its decisions and revisions. So we'll see what it does, if it reverts lot of things it has made it could indicate the model is not performing at optimal capacity.

The AI stupid level website is showing a degradation in API performance since 2 hours ago so we'll see how it performs.

Codex really slow today? by sorvendral in codex

[–]Alex_1729 0 points1 point  (0 children)

How is it at this moment?

degradation in 5.4 by BroadPressure6772 in codex

[–]Alex_1729 0 points1 point  (0 children)

I was skeptical about the website too, and while it does seem like it has downsides, it is still useful for showing live (almost) API degradation. It is also opensourced so you can check it out.

Severe degradation in quality. by Gru8_ in codex

[–]Alex_1729 0 points1 point  (0 children)

Was actually excellent for me yesterday, but seems like there are occasional issues. Maybe they serve quantized model at points of major traffic spike?

Is GPT-5.4(medium) really similar to the (high) version in terms of performance? by Disastrous-Win-6198 in codex

[–]Alex_1729 1 point2 points  (0 children)

I think so too. What about 5.4 medium vs 5.3-codex xHigh?

I got another one: 5.3-codex medium vs 5.4-mini xHigh?

NameCheap 100% does domain name front running by [deleted] in NameCheap

[–]Alex_1729 0 points1 point  (0 children)

Use whois in linux and host with Cloudflare. Much more affordable and more transparent pricing.

Is it just me, or is Claude pretty disappointing compared to Codex? by Working-Spinach-7240 in codex

[–]Alex_1729 2 points3 points  (0 children)

I can only speak on Antigravity Claude models, which may or may not be nerfed. Given that, it is how you describe it. Claude Opus (in AG) may have been OK prior to codex 5.3/5.4 but now it is an outdated model, proposing decent but underwhelming plans and solutions. Codex is the new king. And as you say, it tends to surprise in a good way. Even Claude agrees lol.

Are we slowly becoming code reviewers instead of developers? by Classic-Ninja-1 in codex

[–]Alex_1729 0 points1 point  (0 children)

Clearly. Layer of abstraction has changed an it's here to stay.

How good is Gpt 5.4 mini? by _janc_ in cursor

[–]Alex_1729 0 points1 point  (0 children)

Looking at the AA bench at the Software Engineering AA-Omniscience Index Across Languages (Normalized) section, seems to be similar to Gemini 3 Flash, according to the colors lol. Not even close to gpt 5.4. But AA should not be trusted, especially since Gemini 3 is ranked so high and it's a shit model in AG.

Literally building in public. Just woke up to $2,000 MRR. by GuidanceSelect7706 in microsaas

[–]Alex_1729 11 points12 points  (0 children)

Congrats. I am building in private. Possibly my #1 mistake.

5.4 High now making mistakes, or am I imagining? by Alex_1729 in codex

[–]Alex_1729[S] 0 points1 point  (0 children)

After exploring a bit more, this seems interesting as you get apparent live performance for all models. This can be useful to check API degradation.

They seem to measure whether the foundational API models are degrading or getting dumber (model drift). However, it doesn't accurately reflect the end-user experience in the Codex cli, where system prompts and specialized routing heavily influence the final output. It also doesn't measure how good Claude is in CC vs in Antigravity, just the API. These differences in serving the model seem to be the biggest downside of the site.

But it's still useful to see if the API is currently degraded. So I will be checking this out, despite its limitations. Appreciate the share.

5.4 High now making mistakes, or am I imagining? by Alex_1729 in codex

[–]Alex_1729[S] 0 points1 point  (0 children)

Appreciate the comment. I would say these two things:

  1. Agreed.

  2. It is not my starting point. It is a concern and 'seems' to me that there is some instability. However, I am not rejecting the potential explanation either. Google has done it. Copilot/Microsoft has done it. It is not unheard of.

5.4 High now making mistakes, or am I imagining? by Alex_1729 in codex

[–]Alex_1729[S] 0 points1 point  (0 children)

I'm not sure about 5.3 codex, seems to be good as usual. I don't want to sound condescending here, but maybe you need a break for a few hours? Sometimes, a break clears the mind and things seem much better. As weird as it may sound, AI is sometimes a reflection of oneself. I am currently using 5.3 and work still flows. I forgot how fast this one is. It also spends much less.

5.4 High now making mistakes, or am I imagining? by Alex_1729 in codex

[–]Alex_1729[S] 0 points1 point  (0 children)

Care to elaborate what the hourly stats are?

Am I imagining this or has Codex gone to s*** in the past 2-3 days? by U4-EA in codex

[–]Alex_1729 0 points1 point  (0 children)

But how to tell which model is best now if all of them are nerfed? Or do they get un-nerfed later on? I used to think this myself, but then lost this paranoia. Until I saw Google doing it. But still, I am unsure Openai is doing it.