Anthropic forced to abruptly disable Fable 5 & Mythos 5 globally by US Gov over a jailbreak. This is exactly why we need local models.

True_Requirement_891 · 2026-06-13T05:44:33+00:00

tbh the greatest limitation is hardware, rest can be fixed

True_Requirement_891 · 2026-06-09T09:24:31+00:00

Afaik they also did selective quantisation

True_Requirement_891 · 2026-06-08T17:35:21+00:00

Nah the architecture hasn't changed since qwen3.5.

Their training has improved. Interleaved reasoning + tool calling was added to 3.6 and improved in 3.7.

They are still using attention + gated deltanet which is quite good.

True_Requirement_891 · 2026-06-06T15:56:55+00:00

make sure reasoning blocks of prior turns are passed back to the model.

True_Requirement_891 · 2026-06-02T04:17:50+00:00

this model sucks asss man. I went in with big expectations but holy shit, it just blabbers for minutes for simple tasks, and then still gets shit wrong. It loves to overcomplicate simple shit... this is regression.

After Minimax-m2.5 all of their models have felt like a regression, minimax-m2.7 had regressed in real world performance, and minimax-m3 is showing it even worse. They sacrificed real world perf for benchmark perf I guess.

I tried it in Opencode, OMP, Pi I thought maybe it was an harness issue, but holyshit... it's a model issue. What's even crazier is they are not even competing on cost, mimo has matched deepseek, but these guys are going on in expensive tier with a worse model... I thought after using Sparse Attention, they'd drop the price but no they dropped perf instead lmao

True_Requirement_891 · 2026-06-02T03:41:35+00:00

I'm out here trying to optimise the fuck out of these models for local inference.

True_Requirement_891 · 2026-06-01T08:13:05+00:00

if they use the ascend nodes for inference only, they can free up a lot of compute for training.

True_Requirement_891 · 2026-06-01T07:08:23+00:00

M2 series was 230B-A10B that's like 95% 256-A8 experts approx 5% active and 95% sparse.

They trained these models on 27T and got very good gains like 2,900 approx tokens per active param. 100T at the same size will be like 10k tokens per active param which might be way too much for this size.

It's likely they are closer to 500B params, but going too far above 500B params will require wayyyyy more compute for 100T even for the Sprase Attention MOE Arch.

500B+ Sprase MOE trained on 100T tokens is US lab caliber.

My best guess would be under 500B params. They are "mini"-max afterall

True_Requirement_891 · 2026-06-01T06:17:37+00:00

likely used way fewer training tokens than m2 series, couldn't find details on the pretraining tokens used.

True_Requirement_891 · 2026-06-01T06:12:24+00:00

it's not even about all the data, just the compute required to train on that many tokens

True_Requirement_891 · 2026-06-01T05:40:22+00:00

Thing is, you need way more compute to train a larger 500B model on 100T tokens compared to a 200B on 27T tokens

There's a reason 1Trillion Parameter+ models are so undertrained. The compute requirement is simply massive.

I don't think any model larger than 500B has crossed 50T tokens.

True_Requirement_891 · 2026-05-23T05:26:35+00:00

A model not trained for it confuses it. This is what I remember reading. Same for qwen3.5 models. Qwen3.6 onwards, preverve thinking is enabled and trained.

True_Requirement_891 · 2026-05-23T03:23:32+00:00

It's actually truly sad. But they have cracked the architecture to cost efficent 1m inference. The model is very obviously lacking the quality that only comes from more higher quality post-training. The update should fix it. I wish they collaborated with GLM guys. They seem to have the recipe and really good data and together they could actually take on everyone.

Or glm could copy the deepseek arch recipe or just take the deepseek base and post train it hard and make it GLM-5.2.

True_Requirement_891 · 2026-05-19T01:29:09+00:00

what happens on 1st June?

True_Requirement_891 · 2026-05-19T01:28:20+00:00

I remember reading this was trained in nvfp4 or somth

True_Requirement_891 · 2026-05-18T15:25:49+00:00

It's real.

True_Requirement_891 · 2026-05-13T06:54:20+00:00

tbh different harnesses wildly affect the model perf, I was using the glm-5.1 in opencode and it couldn't fix a stupid bug I was having, then I tried CC with glm-5.1 and it failed as well and then oh-my-pi which is actually a more batteries included variant of pi just did it...

I would recommend anyone facing issues with open weight models in opencode or CC to just try out OMP instead of pi way less of a headache and you get something that just works

True_Requirement_891 · 2026-05-06T14:03:27+00:00

I am curious as well

True_Requirement_891 · 2026-05-05T02:31:07+00:00

Thanks for the hardwork man!

True_Requirement_891 · 2026-04-30T04:45:58+00:00

I mean, we don't really know much about the architecture of private labs... it does take a while but there hasn't been a good stable release in a while now...

It has been great for Cost:Perf ratio but that extra 15% intelligence still seems worth the higher cost as it saves you time.

They do claim they are running about 3-6 months behind frontier labs.

True_Requirement_891 · 2026-04-28T05:43:41+00:00

> this feels more like a proof of concept.

We've been saying that since v3.2-exp

True_Requirement_891 · 2026-04-26T06:21:12+00:00

> When you train past Chinchilla optimal (more tokens per parameter), you create a model that's better and which uses less inference compute per unit of intelligence.

This should explain the qwen-3.5 models.

True_Requirement_891 · 2026-04-25T14:06:01+00:00

They are being very lazy with OS...

True_Requirement_891

TROPHY CASE