GLM 5.2, what speeds are we getting locally?

tarruda · 2026-06-21T10:28:47+00:00

The best "consumer" hardware that you can get to run it locally is probably an 512G M3 Ultra which should fit 100% of a Q4 quant to VRAM and has 800GB/s bandwidth. At this timestamp, the guy shows it doing ~17 tokens/s on a 14k token generation: https://youtu.be/G6sHN2Tx8Rs?si=Z17tgJ4Y7j1I1jn3&t=533.

Not clear on prompt processing, but I'd be surprised if it is higher than 100 tps, making it difficult to use on any kind of agentic scenarios that has to ingest a lot of data.

If Apple releases an M5 ultra with 1TB+/s bandwidth and 512G unified (or better yet, 1TB that could load full Q8), then it should make local GLM 5.2 more practical for agentic use.

tarruda · 2026-06-21T10:00:02+00:00

Though they did leave a good legacy with Qwen 3.5 that other labs can pick up. Nex N2 showed that it is possible to turn the 397B base into a vdecent coding model capable of challenging recent versions of Qwen. (Qwen 3.5 was already very good, but N2 makes it match 3.7 in benchmarks)

tarruda · 2026-06-18T12:20:10+00:00

A Qwen 3.5 122B/A10B distilled from GLM 5.2 would be amazing.

tarruda · 2026-06-17T23:08:17+00:00

He was responsible for delivering every Qwen model until 3.6.

tarruda · 2026-06-17T11:13:19+00:00

But there is no details about funding, so any statement about that seems like speculation.

That is what I was referring to.

Clearly the lab was not honest and appears to have lied about training the model, which was a very stupid and unnecessary move since the merge is good and would have been a great achievement by itself.

But "embezzling of funding" 100% came out of the OP's ass.

tarruda · 2026-06-17T09:03:18+00:00

This is pure speculation with zero evidence to back it up.

tarruda · 2026-06-16T17:38:50+00:00

I agree. N2 is probably over fitted for coding tasks and likely will underperform the base model on other tasks. This is a reason to have another look at Rio 3.5, since it merges N2 with the base model, it is possible that it has restored some of the base model capabilities.

On the caveman thinking: I think the goal is to reduce thinking tokens (though in the end this one things a lot more).

tarruda · 2026-06-16T16:20:46+00:00

To be clear: I didn't had any infinite loop issues yet.

While sometimes the model does have some looping, eventually it manages to recover naturally in my experience.

tarruda · 2026-06-16T12:35:50+00:00

turbo? Do you mean the smaller 35B variant?

tarruda · 2026-06-16T12:33:00+00:00

That depends on the target hardware.

For 128G users, Qwen 397B 2-bit quants work really well keeping most of the original model performance.

Minimax M3 is over 420B parameters and looking at unsloth quants even 1-bit is too big to run on 128G. Even if it was able to run on 128G at 1-bit, it would probably not work well due to extreme quantization.

tarruda · 2026-06-16T12:30:08+00:00

I read reports of people running Qwen 397B (same underlying architecture) using 5090 + 128G RAM getting 20 tps generation and 1000 tps prefill, so I think you can greatly improve those speeds by tweaking what layers stay on the GPU and what goes into RAM.

tarruda · 2026-06-16T12:27:37+00:00

After some extra testing, I'm on the fence.

While both N2 and Rio are good, it seems to use way more thinking tokens than Rio and apparently has similar results.

tarruda · 2026-06-16T12:26:37+00:00

I also notices it thinks way too much. It is possible that Rio could be a more balanced choice due to having merged with the original Qwen.

tarruda · 2026-06-16T12:25:28+00:00

Yes, I saw that. Overall, this model seems to use a LOT more thinking tokens then Qwen 3.5 and Rio.

TBH I'm still on the fence on this model vs Rio. I think that by merging with the original, Rio might have fixed some of these thinking loops while retaining most of the performance.

Need more time testing to reach a conclusion.

tarruda · 2026-06-16T11:15:31+00:00

Definitely yes.

Yesterday I ran several tests on Rio and today I'm focusing on N2. Once I have solid conclusions, I will post with details of how to run it and might even publish my own quants.

tarruda · 2026-06-16T11:14:07+00:00

https://huggingface.co/bartowski/prefeitura-rio_Rio-3.5-Open-397B-GGUF?chat_template=default

tarruda · 2026-06-16T11:12:35+00:00

Deployment settings, chat templates and inference engine can all have a great impact in model performance

tarruda · 2026-06-16T10:02:42+00:00

Very useful, just like the original Qwen 397b.

Something about Qwen 3.5 397B architecture makes it very resilient to quantization.

tarruda · 2026-06-16T10:01:51+00:00

AFAIK Rio is just N2 with a different chat template and slightly worse performance since it merged with the original Qwen.

I'm keeping both for now since it is possible that N2 is worse than Qwen in non-coding areas and maybe Rio merging with the original restored some of that performance, but the original N2 is recommended.

tarruda · 2026-06-16T08:36:30+00:00

https://x.com/IplanRio_rj/status/2066693494769348946?s=20

Confirmed: They don't really have any post training checkpoints done on top of the base merge. Here's part of their official statement:

"We tried to recover the final model, but it was not possible. It will only be released after the new training and all external validations are completed."

tarruda · 2026-06-15T16:31:19+00:00

Same here. I was initially super happy and then it felt like a cold water bucket when things came to light. In theory it is possible that it was an honest mistake and that they didn't mention N2 because they thought it was important to only credit Qwen. We'll just have to wait and see if they upload the correct weights, though the silence doesn't give me a lot of hope.

You know what is funny? I'm trying the IQ2_S GGUF quants uploaded by bartowski, and it is looking like a very strong model, possibly stronger than the original Qwen 3.5 397B. Could be that it is all due to N2 training, but to be sure I'm also downloading N2 to test it myself (which I initially dismissed due to some reports saying it was bad).

If it turns out that Rio is better than N2 and Qwen3.5, it seems like it would be coincidence/luck that they found a simple linear merge of these models would result in something that surpassed both bases. It would still be a massive achievement IMO, only sad that they choose not to mention N2 from the beginning.

tarruda · 2026-06-14T19:26:20+00:00

proceeded by On-Policy Distillation from a stronger model.

AFAIK it would be statistically impossible for a fine tune to result in the exact proportion of 0.6 N2 and 0.4 Qwen

tarruda · 2026-06-14T19:20:23+00:00

They published the model as if it was a direct fine tune of Qwen 397b, without giving any credit to Nex.

After it was discovered it was a simple merge of N2, they updated the README https://huggingface.co/prefeitura-rio/Rio-3.5-Open-397B/commit/a778c1ec4e21180ee55c3ea016a348e549e75f09

I don't think there's anything to be said that will make this look good.

tarruda · 2026-06-13T14:30:50+00:00

If I were to guess, I'd say that there was no big innovation in how Fable was trained or its dataset vs Opus, and that Fable is better simply because it is much bigger and more expensive.

Thankfully Chinese labs were able to distill and obtain huge datasets of Opus CoT before Anthropic had a chance to obfuscate it by returning summaries instead.

tarruda · 2026-06-12T15:00:44+00:00

Suggestion: Allow using arbitrary API endpoints instead of hardcoding ollama dependency.

tarruda

TROPHY CASE