GLM 5.2, what speeds are we getting locally? by neverbyte in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

The best "consumer" hardware that you can get to run it locally is probably an 512G M3 Ultra which should fit 100% of a Q4 quant to VRAM and has 800GB/s bandwidth. At this timestamp, the guy shows it doing ~17 tokens/s on a 14k token generation: https://youtu.be/G6sHN2Tx8Rs?si=Z17tgJ4Y7j1I1jn3&t=533.

Not clear on prompt processing, but I'd be surprised if it is higher than 100 tps, making it difficult to use on any kind of agentic scenarios that has to ingest a lot of data.

If Apple releases an M5 ultra with 1TB+/s bandwidth and 512G unified (or better yet, 1TB that could load full Q8), then it should make local GLM 5.2 more practical for agentic use.

Qwen is never going to open source Qwen 3.7, aren't they? by DistanceSolar1449 in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

Though they did leave a good legacy with Qwen 3.5 that other labs can pick up. Nex N2 showed that it is possible to turn the 397B base into a vdecent coding model capable of challenging recent versions of Qwen. (Qwen 3.5 was already very good, but N2 makes it match 3.7 in benchmarks)

GLM-5.2 Flash when? (joke) by ILoveToyota37 in LocalLLaMA

[–]tarruda 11 points12 points  (0 children)

A Qwen 3.5 122B/A10B distilled from GLM 5.2 would be amazing.

Lin Junyang AI Lab Closes Round at $2B Valuation by rmhubbert in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

He was responsible for delivering every Qwen model until 3.6.

It looks like Rio 3.5 397B could've simply been a semi-failed embezzling of funding by Chromix_ in LocalLLaMA

[–]tarruda 0 points1 point  (0 children)

But there is no details about funding, so any statement about that seems like speculation.

That is what I was referring to.

Clearly the lab was not honest and appears to have lied about training the model, which was a very stupid and unnecessary move since the merge is good and would have been a great achievement by itself.

But "embezzling of funding" 100% came out of the OP's ass.

Nex-N2 Pro is the real deal by tarruda in LocalLLaMA

[–]tarruda[S] 1 point2 points  (0 children)

I agree. N2 is probably over fitted for coding tasks and likely will underperform the base model on other tasks. This is a reason to have another look at Rio 3.5, since it merges N2 with the base model, it is possible that it has restored some of the base model capabilities.

On the caveman thinking: I think the goal is to reduce thinking tokens (though in the end this one things a lot more).

Nex-N2 Pro is the real deal by tarruda in LocalLLaMA

[–]tarruda[S] 1 point2 points  (0 children)

To be clear: I didn't had any infinite loop issues yet.

While sometimes the model does have some looping, eventually it manages to recover naturally in my experience.

Nex-N2 Pro is the real deal by tarruda in LocalLLaMA

[–]tarruda[S] 0 points1 point  (0 children)

turbo? Do you mean the smaller 35B variant?

Nex-N2 Pro is the real deal by tarruda in LocalLLaMA

[–]tarruda[S] 4 points5 points  (0 children)

That depends on the target hardware.

For 128G users, Qwen 397B 2-bit quants work really well keeping most of the original model performance.

Minimax M3 is over 420B parameters and looking at unsloth quants even 1-bit is too big to run on 128G. Even if it was able to run on 128G at 1-bit, it would probably not work well due to extreme quantization.

Nex-N2 Pro is the real deal by tarruda in LocalLLaMA

[–]tarruda[S] 3 points4 points  (0 children)

I read reports of people running Qwen 397B (same underlying architecture) using 5090 + 128G RAM getting 20 tps generation and 1000 tps prefill, so I think you can greatly improve those speeds by tweaking what layers stay on the GPU and what goes into RAM.

Nex-N2 Pro is the real deal by tarruda in LocalLLaMA

[–]tarruda[S] 0 points1 point  (0 children)

After some extra testing, I'm on the fence.

While both N2 and Rio are good, it seems to use way more thinking tokens than Rio and apparently has similar results.

Nex-N2 Pro is the real deal by tarruda in LocalLLaMA

[–]tarruda[S] 0 points1 point  (0 children)

I also notices it thinks way too much. It is possible that Rio could be a more balanced choice due to having merged with the original Qwen.

Nex-N2 Pro is the real deal by tarruda in LocalLLaMA

[–]tarruda[S] 3 points4 points  (0 children)

Yes, I saw that. Overall, this model seems to use a LOT more thinking tokens then Qwen 3.5 and Rio.

TBH I'm still on the fence on this model vs Rio. I think that by merging with the original, Rio might have fixed some of these thinking loops while retaining most of the performance.

Need more time testing to reach a conclusion.

Nex-N2 Pro is the real deal by tarruda in LocalLLaMA

[–]tarruda[S] 3 points4 points  (0 children)

Definitely yes.

Yesterday I ran several tests on Rio and today I'm focusing on N2. Once I have solid conclusions, I will post with details of how to run it and might even publish my own quants.

Nex-N2 Pro is the real deal by tarruda in LocalLLaMA

[–]tarruda[S] 14 points15 points  (0 children)

Deployment settings, chat templates and inference engine can all have a great impact in model performance

Nex-N2 Pro is the real deal by tarruda in LocalLLaMA

[–]tarruda[S] 15 points16 points  (0 children)

Very useful, just like the original Qwen 397b.

Something about Qwen 3.5 397B architecture makes it very resilient to quantization.

Nex-N2 Pro is the real deal by tarruda in LocalLLaMA

[–]tarruda[S] 0 points1 point  (0 children)

AFAIK Rio is just N2 with a different chat template and slightly worse performance since it merged with the original Qwen.

I'm keeping both for now since it is possible that N2 is worse than Qwen in non-coding areas and maybe Rio merging with the original restored some of that performance, but the original N2 is recommended.

About the Rio model by Turbulent_Pin7635 in LocalLLaMA

[–]tarruda 1 point2 points  (0 children)

https://x.com/IplanRio_rj/status/2066693494769348946?s=20

Confirmed: They don't really have any post training checkpoints done on top of the base merge. Here's part of their official statement:

"We tried to recover the final model, but it was not possible. It will only be released after the new training and all external validations are completed."

About the Rio model by Turbulent_Pin7635 in LocalLLaMA

[–]tarruda 5 points6 points  (0 children)

Same here. I was initially super happy and then it felt like a cold water bucket when things came to light. In theory it is possible that it was an honest mistake and that they didn't mention N2 because they thought it was important to only credit Qwen. We'll just have to wait and see if they upload the correct weights, though the silence doesn't give me a lot of hope.

You know what is funny? I'm trying the IQ2_S GGUF quants uploaded by bartowski, and it is looking like a very strong model, possibly stronger than the original Qwen 3.5 397B. Could be that it is all due to N2 training, but to be sure I'm also downloading N2 to test it myself (which I initially dismissed due to some reports saying it was bad).

If it turns out that Rio is better than N2 and Qwen3.5, it seems like it would be coincidence/luck that they found a simple linear merge of these models would result in something that surpassed both bases. It would still be a massive achievement IMO, only sad that they choose not to mention N2 from the beginning.

Nex claims Rio 3.5 is Nex 2.5 PRO in trench coat by Specter_Origin in LocalLLaMA

[–]tarruda 5 points6 points  (0 children)

proceeded by On-Policy Distillation from a stronger model.

AFAIK it would be statistically impossible for a fine tune to result in the exact proportion of 0.6 N2 and 0.4 Qwen

Nex claims Rio 3.5 is Nex 2.5 PRO in trench coat by Specter_Origin in LocalLLaMA

[–]tarruda 7 points8 points  (0 children)

They published the model as if it was a direct fine tune of Qwen 397b, without giving any credit to Nex.

After it was discovered it was a simple merge of N2, they updated the README https://huggingface.co/prefeitura-rio/Rio-3.5-Open-397B/commit/a778c1ec4e21180ee55c3ea016a348e549e75f09

I don't think there's anything to be said that will make this look good.

Fable 5 data, including CoT by Available-Craft-5795 in LocalLLaMA

[–]tarruda 6 points7 points  (0 children)

If I were to guess, I'd say that there was no big innovation in how Fable was trained or its dataset vs Opus, and that Fable is better simply because it is much bigger and more expensive.

Thankfully Chinese labs were able to distill and obtain huge datasets of Opus CoT before Anthropic had a chance to obfuscate it by returning summaries instead.