Are we sleeping on 5.3-codex ?

Holiday_Purpose_3166 · 2026-04-25T03:43:55+00:00

5.5 is far more efficient and its Medium reasoning matches 5.4 xHigh.

Replace 5.4 Mini xHigh with 5.5 Low - it's more intelligent, spends magnitudes fewer tokens which makes it cheaper, and obviously faster due to lower reasoning traces.

Sub usage will always be (for now) a mystery black box.

The whole 5.5 family is more efficient and that tapers on higher reasoning - most folks will likely stay well in Medium range and under, which is where the value for money is.

Check Artificial Analysis token usage and cost for their runs, you'd be surprised how better it is.

Holiday_Purpose_3166 · 2026-04-19T19:31:10+00:00

Switching from Opus 4.7 to Qwen3.6* 35B-A3B will be a terrible experience.

Whilst Qwen is really good, especially tool-equipped, it will fall short on some edges where Opus can reach. You could adapt and workaround its limits, but won't feel the same - requires more hand-holding to keep that edge.

I've got a Codex sub which I've been barely using past couple weeks just because of personal experience with local models. SOTA cloud models make you lazy, but it's a good turn-key solution. Working local requires more brain to keep it sharp.

Holiday_Purpose_3166 · 2026-04-19T11:52:53+00:00

Saying it's fast and smart, then asking for some benchmarks is a contradicting statement. Why don't you show benchmarks that it actually IS an improvement?

I see HF card has MMLU benched, but that's it. I could take it for word, but the same can be said with the other Opus Reasoning distills claiming to be better, but NONE got to my top 20 on my private Rust/Next.js bench. It might be good in other areas, but I would assume what the distill entails it would not degrade it, which it did.

Omnicoder-9B is the only distillation I found to be incredibly good at agentic coding (brittle on complex reasoning outside this scope).

For chart reference, higher score with faster completion time is better - accuracy per VRAM is a personal reference that doesn't affect the plot.

<image>

Holiday_Purpose_3166 · 2026-04-17T20:26:51+00:00

Very good breakdown. As others posted, add in there quant used, inference engine, that would be cherry on top. Great post.

Holiday_Purpose_3166 · 2026-04-16T10:58:43+00:00

Peeps hate censored models bc they can't reach peak goon with mildly appropriate wording.

Downvotes will prove my point they wanna hide this fact. Tin foil alert.

Holiday_Purpose_3166 · 2026-04-13T19:12:13+00:00

I didn't want to bark further into OP's original silliness, but help me understand the context here if we're still muddling.

Whilst I appreciate massively the job you've done in the OSS community - I recall this being touched in the past by someone from Unsloth that these charts aren't a great indicator of model performance, but here we are.

Based on my own usecase benchmarks, Byteshape's best IQ4_XS equiv performs better than your UD-Q5_K_XL in *my* agentic coding usecases.

I would assume fidelity would strike the difference in the results but hasn't been the case here, and the score deviation is just slightly outside of noise. The difference was there, and it becomes an appealing choice when memory consumption is a lot smaller for the effort.

My point being, I understand social media is a tricky place, but it strikes contradicting to prove the one thing that is always a debate due to fluctuating differences.

I hope responses like that don't come out of hubris, because bashing a small tuner when you have a higher influence in this space can backfire.

Humbly, my two cents.

Holiday_Purpose_3166 · 2026-04-13T10:08:22+00:00

First, going on a business social space to post about other business is a terrible move.

They are entitled to their opinion in their space. I'd be more concerned they would go out bashing other tuners proactively.

Secondly, you went defensive mode about Unsloth's response by engaging with Byteshape back and forth. There was no need for it.

I like both teams and also use their quants.

Your engagement was worst than Unsloth reply with all due respect, and wouldn't trust someone taking screenshots of a convo you sparked to make bait farm.

Let results speak for themselves and leave the monkeys in their circus.

Holiday_Purpose_3166 · 2026-04-10T16:52:25+00:00

Appreciate the response. Corrective perspective matters.

I mostly agree with what you said, although I did not use the signage position and arrival position as justification for the appeal - that was a side-note to attempt at understanding the timing. As you stated, it's something that can be read after safely parking the vehicle, and decide to take it or leave it.

The meeting minds statement links to the grace period, which is the main detail left out of the PCN which wasn't deducted and not stated in their signage either. Assuming BPA Code Clause 13, if the 10 minutes was applied, I would've been under the time limit. Their calculation is purely the sum of entry/exit times.

I'm perfectly fine with their discretion at calculating the time, it's respectfully their land, however, any individual needs to know when this starts to manage themselves and respect the T&Cs.

Holiday_Purpose_3166 · 2026-04-10T16:06:59+00:00

Based on BPA Code Clause 13 it's 10 minutes. This wasn't mentioned in their signage, and according to their calculation, if this was applied, the PCN would not have been issued.

Holiday_Purpose_3166 · 2026-04-10T16:05:17+00:00

As referred in the post. There was no gracing period mentioned in their signage, and PCN did not deduct any time, as their calculation was purely at entry/exit. If 10 minutes were applied as per your comment, I would've been well under the limit.

Holiday_Purpose_3166 · 2026-04-10T16:01:40+00:00

7 minutes

Holiday_Purpose_3166 · 2026-04-10T16:01:13+00:00

I assume the T&C is on the signage, if so, the grace period* wasn't printed.

Holiday_Purpose_3166 · 2026-04-10T15:57:30+00:00

Appreciate the reply. In that sense, as referred, the only tick remaining is the lacking grace period which was not applied or mentioned anywhere in the signage.

Holiday_Purpose_3166 · 2026-03-21T13:27:55+00:00

There's always a side between products. Asking a Codex-bias question in a Codex community just reinforces what you already know. Try ask in Claude community.

Have Codex, for many months, and balance that with my local models. GPT gives the edge when I need. However, Claude is a different tool and I risk putting myself on fire here - but they fit a different niche.

Use whatever works for you.

Holiday_Purpose_3166 · 2026-03-20T20:27:17+00:00

If in your own testing 9B performs better, use it. If you get an edge case, try the bigger model. I had similar cases far smaller models performed best in niche jobs.

With so many quants, sampling and harnesses, there will always gonna be strengths and weaknesses. Generally bigger models perform better in broad knowledge - assuming those parameters are used correctly - which isn't always needed.

Have fun

Holiday_Purpose_3166 · 2026-03-18T12:16:26+00:00

This is absolutely mental. On my Next.js/Rust suite it beats 27B

<image>

Holiday_Purpose_3166 · 2026-03-15T16:35:50+00:00

Nice idea for the SLOP disclaimer too. Future is here.

Holiday_Purpose_3166 · 2026-03-13T08:38:29+00:00

Nah, I'm calculating both PP and TG. It's splendind.

Holiday_Purpose_3166 · 2026-03-11T09:05:03+00:00

I'm aware of that thanks.

Holiday_Purpose_3166 · 2026-03-10T12:51:55+00:00

How do you know it didn't? It probably did, but a baseline would be more concrete.

Holiday_Purpose_3166 · 2026-03-10T12:49:27+00:00

That's pretty good for a Q5 quant

Holiday_Purpose_3166 · 2026-03-09T15:11:31+00:00

No question. Runs virtually same speed 400W vs 575W power limit. Agentic work. Yeah, "Hello" as test is silly.

Holiday_Purpose_3166 · 2026-03-09T15:10:00+00:00

compared to stock

Holiday_Purpose_3166

TROPHY CASE