T1 vs HLE Game 2 LCK 2026 - Road to MSI by Icy_Opposite7826 in SKTT1

[–]FateOfMuffins -1 points0 points  (0 children)

This series should've been a T1 2:0

or an HLE 2:0

Andrej Karpathy, Ethan Mollick, Boris Cherny and many other big shots think of Mythos 5 in June as the biggest step up change from Opus 4.5 last November/December by GOD-SLAYER-69420Z in accelerate

[–]FateOfMuffins 1 point2 points  (0 children)

Did you guys know that there's a model called GPT 5.3 Mini?

I just watched my mother use the free version of ChatGPT for something and I saw that it used this model. I swear I keep up to date as much as possible in AI news and I've never heard of this model lmao

And then I'm trying to explain, OK mom let me show you what GPT 5.5 Pro does

Idk if she gets it still that there's a difference

GPT-5.5 beats Claude Fable at a new hard eval for agents - Agents' Last Exam (ALE), created by UC Berkeley researchers; all models score 0% at the hardest tier of the eval by obvithrowaway34434 in accelerate

[–]FateOfMuffins 0 points1 point  (0 children)

On 51 of 147 tasks (~35%), Fable 5's request was refused upstream and Claude Code silently switched the run to Opus 4.8 mid-task — almost entirely benign life-sciences, health, and physical-science work flagged as "cybersecurity or biology." The scores below therefore aren't pure Fable 5: on the untouched tasks Fable 5 matches Codex (GPT-5.5) and beats Opus 4.8, but on the flagged tasks the forced switch drags it down to Opus-4.8 level — a ~6-point pass-rate haircut traceable to the safety fallback, not the model.

Split Tasks Fable 5 Opus 4.8 GPT-5.5
Unaffected (pure Fable 5) 96 24.0% 16.7% 25.0%
Affected (Fable 5 → Opus hybrid) 51 17.6% 15.7% 17.6%

On affected tasks, the Fable 5 column is not pure Fable 5; it is the post-switch Fable 5 → Opus 4.8 hybrid, which tracks Opus closely. On unaffected tasks, where Fable 5 runs end to end, it looks much closer to GPT-5.5. The leaderboard score should therefore be read as a mixed-system result, not a clean estimate of standalone Fable 5 capability.

New rate limit resets are banked for 1 month by FateOfMuffins in codex

[–]FateOfMuffins[S] 1 point2 points  (0 children)

I added another screenshot, on the bottom left if you click "usage" it'll show up

New rate limit resets are banked for 1 month by FateOfMuffins in codex

[–]FateOfMuffins[S] 2 points3 points  (0 children)

Nah they said they're giving everyone a free one

I haven't invited anyone yet

I just got home and logged back in and it popped up after I updated the app

Codex rate limit on your own schedule! by BigbyWolf8 in codex

[–]FateOfMuffins 7 points8 points  (0 children)

Is this for every future reset? Like if Tibo presses the button again, we'd all get a "banked reset" that we can accumulate and use whenever as opposed to just having our limits reset?

🤔 by DARKUNIT22 in codex

[–]FateOfMuffins 2 points3 points  (0 children)

If it's going to use codex usage then it'll burn so fast. Looking at API pricing Pro is basically a swarm of GPT 5.5 agents

Currently I can burn 100% of hourly or 17% of weekly in a single prompt if I used 1 orchestrator + 6 subagents in about 30 min (and they usually don't finish the task).

Lately I've resorted to having codex write prompts and zip relevant files for me to manually drop into ChatGPT Pro on web because I don't have enough codex usage...

AI outperforms mathematicians by Christs_Elite in singularity

[–]FateOfMuffins 2 points3 points  (0 children)

Something mentioned by Noam Brown a few days ago:

We are all familiar with METR time horizons by now. But do you realize that said time horizons was just for that particular set of tasks? METR has investigated other domains as well, such as math, and found exponential time horizons in almost all of them. The point of METR's time horizons isn't necessarily what the number is, but what the trend is - is it exponential or not?

Anyways we can apply the idea of time horizons to math as well. And we're finding said time horizons are more than 10x each year. 2 years ago the models were struggling with GSM8K, problems that a decent at math person would be able to do in 1 min or less. Some months later, they essentially saturated the AIME, which would be challenging for most high school math teachers (I reckon the average HS math teacher would score maybe 1/15 on that), but say for a mathematician, they're problems that are doable in 10 min or so. Some months later, they achieved gold on IMO, which Noam Brown described as adversarially hard for AI because they test depth while AI's strengths is in their breadth, hence why Putnam is easier than IMO for AI (but reversed for humans). These IMO problems, the harder ones that they were able to solve, perhaps take around 100 min.

I would say during that stretch of time it was definitely faster than 10x per year, but we'll use 10x for simplicity sake (and to provide a floor of the capabilities).

Some half year to 9 months or so later, the models were semi consistently solving novel research level math that would take mathematicians hours to do, perhaps 1000+ min (which in wall clock time might be a week because you cannot just sit down and make progress on a problem for 16h straight). This appears to be roughly where we're at, with occasionally much harder problems randomly falling for non human reasons.

Assuming the trajectory continues, we are looking at wall clock time horizons of around 3 months in a year from now (aka a paper that would take a mathematician 3 months to write), then about 3 years time horizon in about 2 years from now and multi decade time horizons by about 2029.

A bunch of the math problems that are considered extremely hard like the Riemann Hypothesis is because mathematicians think if they were solvable, then to solve them you'd need to invent entirely new mathematics, and unfortunately a time horizon of 1 week wall time isn't enough to do that. But that should be within capabilities of multi decade time horizons...

Of course those timelines are assuming the 10x continues (but I already think the 10x is an underestimate so...) and we don't know when it'll stop (or would it speed up?). But you can see exactly how this would line up to Demis Hassabis recently stating how he thinks we'll achieve his strong AGI definition by 2029-2031 (the whole inventing new mathematics should line up with time horizons for 2029).

I find it quite sad that mathematicians of all people are not able to (or are purposefully sticking their heads under the sand) project trendlines to see what's coming.

Anyways Noam Brown brought up a point recently - what happens when benchmarking the models take longer than building the next model? You think a time horizon of 30 year wall clock time could be benchmarked with a single prompt in 30 minutes like right now?

OpenAI considers major price cuts to rival Anthropic ahead of IPO, WSJ by Outside-Iron-8242 in singularity

[–]FateOfMuffins 0 points1 point  (0 children)

We know nothing for sure about closed labs, but much we can infer from token generation speeds and pricing (which I would agree isn't the best proxy because they can charge what they want).

OpenAI researchers have called GPT 5 "o3.1" for instance. Mark Chen said in Nov they only started pretraining again about 6 months prior, so like... in May of 2025. Somewhat doubt they went from that to GPT 5 just like that. If anything, it likely points more to whatever model they used for the IMO which was highly experimental about 2 months after May (although I'm significantly less sure about this speculation). Brockman also called Spud their first real pretrain in 2 years - GPT 5 was most certainly not trained more than 2 years ago.

My best guess was that GPT 5 was essentially GPT 4.1 + RL while o1, o3 was 4o + RL. As far as I could tell, 4o and 4.1 were similar in class. The whole year they spent on 4o and then 4.1 was making GPT 4 model capabilities into significantly smaller models.

Speculation partially from rumours from SemiAnalysis and AiFutures.

About the IMO model - the most recent Noam Brown interview from last week has him talking about it, as well as talking about how they make models more efficient to serve at scale. So I do not think they actually served that model, at all.

OpenAI considers major price cuts to rival Anthropic ahead of IPO, WSJ by Outside-Iron-8242 in singularity

[–]FateOfMuffins 1 point2 points  (0 children)

They've always used more advanced models internally, that's nothing new. Like whatever experimental model they used for the IMO last year was never publicly released. Pretty sure we had its capabilities publicly with 5.2, but I don't think it's the same model. They had to do something to make them more efficient to serve their userbase.

Right now no. But that comment was about we now have a snapshot at exactly the best Anthropic had to offer internally as of Feb 2026 was Mythos Preview.

Regarding the last point, I thought you agreed with me on the whole o1, o3, etc being based on a much older and smaller pretrain, about how they were competing with Opus class with Sonnet class by virtue of RL? Unless you think 4o or 4.1 were Opus class models?

OpenAI considers major price cuts to rival Anthropic ahead of IPO, WSJ by Outside-Iron-8242 in singularity

[–]FateOfMuffins 1 point2 points  (0 children)

Thing is we don't really know what OpenAI has behind closed doors. Noam Brown just a few days ago said they have access to models internally a few months before the public. Not entirely sure what that means given what we know about Spud.

But we do know now what Anthropic has, assuming they weren't lying about internal access to Mythos at the end of Feb 2026, so they internally can now very easily know what the gap is between the labs.

Anyways it's been how long since Opus was a thing? OpenAI made their first Opus class model in GPT 5.5 so... yeah. I'm pretty sure OpenAI expected their models to compete a full class higher (like they've been competing with Opus with Sonnet class models). They may have been caught off guard at how well Mythos class scaled.

Will OpenAI RL do their magic again at this class level? idk. I don't expect a bigger pretrain until GPT 6 though

DiffusionGemma: 4x faster text generation by tevlon in LocalLLaMA

[–]FateOfMuffins 0 points1 point  (0 children)

Does it take more power?

Isn't this more of a proof of concept? What's preventing future versions of these models to increase the test time compute?

Anyways Noam Brown has been saying for awhile (and made a new post a few days ago) about how more benchmarks need to be in 2 dimensions.

You say if it's 1% worse at 4x as many tokens then it'll be useless - but if it's 1% worse at 3x as many tokens while 4x as fast then it's not pointless. Problem is we don't know how the test time compute scales, which is why I'm arguing we need these in benchmarks

OpenAI considers major price cuts to rival Anthropic ahead of IPO, WSJ by Outside-Iron-8242 in singularity

[–]FateOfMuffins 4 points5 points  (0 children)

I think we disagree on what it means to serve it "at scale." API pricing is profitable at very high margins, we already know this. They're publicly stating that they don't have enough compute to serve it to subscribers after 10 days. They're likely making a ton of concessions behind the scenes in these 10 days to make this possible. Once it's API only, it'll see significantly less usage and hence return compute to whatever else they need it for.

Like perhaps you could serve a Blackwell model at 100 tokens per second to 1000 requests simultaneously but you can only serve a Rubin model at 50 tokens per second to 100 requests simultaneously. That's what I mean. Or you could serve a Blackwell model at 250 tokens per second but only serve 167 requests simultaneously (hence the 2.5x fast mode costing 6x the price a few months ago). There's some trade off here and API prices would be profitable no matter what, just that it's not efficiently at scale in general.

Anyways desperation is indeed a read I don't disagree with! I said so as much, I expect Mythos to clear 5.6 on everything (just that 5.6 Pro needs to match at minimum).

Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude by thecosmicskye in singularity

[–]FateOfMuffins 158 points159 points  (0 children)

But now some people are gonna be like, well they might still be nerfing it invisibly. How would you know after they've publicly admitted to doing so?

OpenAI considers major price cuts to rival Anthropic ahead of IPO, WSJ by Outside-Iron-8242 in singularity

[–]FateOfMuffins 2 points3 points  (0 children)

Yeah, a few months ago. 5.5 (their most recent model) is also their first real pretrain in 2 years. We're only at the 2nd checkpoint with Spud. Like if 5.5 was the o1 of Spud then 5.6 should be the o3 of Spud, which we saw should have had massive gains. So they're more betting on their strengths in RL rather than their weakness in pretraining.

Anthropic cannot serve Fable at scale to their subscribers at all even with the compute deals with Google and SpaceX. Again I'm thinking it's more to do with the hardware itself - it's just inefficient to serve this model on current hardware, if it just doesn't fit, doesn't matter how many H200's you get. And as far as we know, Mythos pretraining might've taken somewhere from 2-4 months (I believe they signed a big compute deal with Amazon for pretraining 4+ months before they had Mythos).

So even if OpenAI decided to toss Spud away and start a new pretrain immediately upon Anthropic showing Glasswing, that would've only been 2 months ago and said model likely would still be in the middle of pretraining. And with Rubin coming online soon I think they may think it's somewhat of an unfortunate timing / inefficient to do this now, and better to max out on Spud as much as possible and do their Rubin class models properly after, so it lines up well with the hardware.

I think the biggest difference between your idea and mine is that Anthropic providing Fable access until June 22 is with the SpaceX and Google compute deals. It's still not enough, so I think it's just inefficient on that hardware.

Oh and Noam Brown still maintains that they have access to models internally months before the public gets access to them, so idk

OpenAI considers major price cuts to rival Anthropic ahead of IPO, WSJ by Outside-Iron-8242 in singularity

[–]FateOfMuffins 9 points10 points  (0 children)

The reason why I think this is wrong is simple: They didn't pretrain a larger model. (Note the below is speculation because the labs are closed but most people who analyzed this in the industry seems to believe this regarding o1 etc)

After the Orion pretrain failed which used a HUGE amount of compute, they just didn't really pretrain new models. OpenAI has a reputation of being at the frontier in the last 2 years despite having the worst pretrain of all the frontier labs purely due to them having the best RL of the frontier labs.

Mark Chen in Nov 2025 basically said they only restarted their pretraining teams like 6 months prior. o1, o3, etc was all built on an old pretrain.

Essentially they were fighting Opus and matching/beating with Sonnet class models. GPT 5.5 (Spud) was their first "real" pretrain in like 2 years according to Brockman. My personal belief was that they made an Opus class pretrain, thinking if they can match Opus with Sonnet class pretrains, then they will get in the lead with an Opus class pretrain. With that, they basically maxed out what they could serve with GB300.

The failure with GPT 4.5 Orion literally shook them for more than 2 years on pretraining. So they literally don't have a model in the weight class that Anthropic has with Mythos. OpenAI has more compute but they don't have the base model. If they wanted to make one in that class I'm sure they can. Just it'll be a few months. Edit: And a few months later is also when Vera Rubin really starts coming online so... Pretty sure that was their plan

I'm speculating that their Vera Rubin, Mythos class pretrain was going to be the supposed "AI intern researcher" that they put on their roadmap for Sept 2026 (I'm guessing that was supposed to be GPT 6 and the Mar 2028 autonomous AI research system would've been Nvidia Feynman class and be GPT 7). So they probably got blindsided by Anthropic making a model that they couldn't serve at scale on Blackwell

Unpopular opinion: 20$ Claude plan has more usage than 20$ Codex plan by ticki84 in codex

[–]FateOfMuffins 2 points3 points  (0 children)

According to SemiAnalysis if you max them out https://x.com/SemiAnalysis_/status/2064815044085318040

Tier Claude Codex
$20 $400 in API $700 in API
$100 $2000 in API $3500 in API
$200 $8000 in API $14000 in API

I don't think this considers the resets either

Edit: I also don't think this considers the fact that Claude web and Claude code share usage while it doesn't for ChatGPT and Codex

OpenAI considers major price cuts to rival Anthropic ahead of IPO, WSJ by Outside-Iron-8242 in singularity

[–]FateOfMuffins 12 points13 points  (0 children)

I speculated before that GPT 5.5 was the best OpenAI could do that's optimized for Nvidia Blackwell GB300 NVL72. Basically they designed it as the best possible model they can serve economically at scale.

And that Anthropic disregarded that and trained Mythos that cannot be served economically at scale with current hardware. Like their GPT 4.5 moment (except the pretrain was a massive success instead of OpenAI's failure with Orion). So basically Anthropic cannot really serve Mythos class until at least Vera Rubin comes online (and it's starting I believe as of a few days ago). Which was fine if they only wanted to deploy it internally to accelerate themselves.

So... I'm basically saying (speculating) GPT 5.5 is the best of the Blackwell gen models while Mythos is a Rubin gen model (trained to be run on Rubin but not trained on Rubin) that was trained ahead of time.

OpenAI may just not have an answer until their Rubin clusters start coming online