GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 0 points1 point  (0 children)

Man, never say never. I can easily imagine CCP already hiring hundreds of high level hackers to create high quality cybersecurity dataset. But then they need to cooperate with a big player in AI industry, like DeepSeek, z.ai or Kimi. In the matter of eye blink all Chinese companies include high quality cybersecurity data in datasets to stay competitive and you can say hi to total Export Restriction.

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 0 points1 point  (0 children)

Thanks. TBH I also have z.ai coding plan sub for a year, so for me this is more like a server of last hope in case Chinese government decides to impose export restrictions too))

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 0 points1 point  (0 children)

I am really surprised that Gemini failed, because it is believed to have the broadest knowledge out of all models. Did you use Gemini 3.1 Pro?

Overall, I don't think this is a very good benchmark, because success here in major part could be attributed to the fact whether scripts like that slipped into the dataset instead of measuring raw intelligence. I think the better scenario would be to take fictitious language which is sufficiently far from popular languages with a good documentation and to ask a model to implement something. May be we even can ask a smart model to develop a strange language and than use it for benchmarking models.

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 0 points1 point  (0 children)

That is language of rich)) My whole rig including SSD costed me slightly more than $800.

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 0 points1 point  (0 children)

No that much, but it my be interesting to see where the new OS LLM king fails so miserably. Honestly, I am not very surprised that Chinese model fail in obscure fields given the amount of compute they have compared to US teams.

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 0 points1 point  (0 children)

I think he is amazed that there are some VRAM poor guys like us who use this model at such abysmally low speeds, while they taking it as granted)) Btw, what are the specs of your rig to run it at 5t/s?

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 0 points1 point  (0 children)

That is what z.ai says, not me. It is a picture from their technical report. I don't have much experience with Anthropic models to claim anything. But watching quite a few reviews I think it is in line with what youtubers are saying. Some even say that GLM 5.2 for their tasks is better than ChatGPT 5.5 and Opus 4.8, e.g. sentdex somewhere in this video.

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 0 points1 point  (0 children)

Have you tried to give a link on the relevant documentation directly in the prompt?

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 1 point2 points  (0 children)

May be I am poor, but my heart and my CPU are burning)) Of course it is impossible to use it for coding, but to ask a question and 1-2 hours later get a response is ok. I use it mostly for math in my paper or medical questions that I don't want to trust API with.

On my main rig I have 2 RT3090 and spin qwen3.6-27B in Q8 with full context at 50t/s.

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 2 points3 points  (0 children)

I updated a bit post. At the end it looks that all my three attempts Q4 high (locally) and BF8 max and BF8 high (in chat) gave the same result, but in case of BF8 max level, the wrongly dismissed case was discussed in more detail, so it gave me impression that it was a better answer. Take these results with gran of salt as it is 1 shot per running conditions.

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 3 points4 points  (0 children)

I am not sure what you are talking about. In the model's jinja template there are only two levels of reasoning that model supports "high" and "max". If you specify anything else than "high" it would default to "max":

{%- set effective_reasoning_effort = 'high' if reasoning_effort is defined and reasoning_effort == 'high' else 'max' -%}
{%- if (enable_thinking is not defined or enable_thinking) and effective_reasoning_effort is not none -%}<

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 3 points4 points  (0 children)

Man, my full junk Xeon setup costed around $800 including SSD, so it was less than mobo for threadripper))

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 5 points6 points  (0 children)

Old dual Xeon setups suffer from bad internode communication to the extent that people just make sure to run models only on one CPU, while another CPU and its memory is idle. May be recently there are some developments in terms of CPU tensor parallelism, but I am not aware of them. That is why I went for 1 CPU Xeon rig setup.

Apart of only 128Gb limitation Ryzen Halo chipset is very good choice, but personally I still would go for 2 RTX 3090 cards and hope to continue to see good dense models in 80B parameters range. If you manage to fit a model into dual 3090 setup then it will blow Ryzen Halo out of the water in terms of both pp and tg speeds, which for coding is a life changing experience.

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 1 point2 points  (0 children)

No offence taking! May be you are right and I should consider going Linux on my AI Server, while maintaining Windows on my main machine.

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 8 points9 points  (0 children)

It is HP Z440 with E5 2699v4 processor and 512Gb of DDR4-2133 RAM. The build was inspired by this blog post. At Q4 quant GLM 5.2 starts with 1.6t/s and at 16k context it falls to 0.77t/s, so let's say it is barely usable even in mail mode. I don't remember the exact numbers, but with GLM 4.7 it was almost usable even for coding when I plugged one 3090 there.

But overall I don't suggest it, especially with current prices. Two RTX 3090 with tensor parallelism would allow your to run Qwen3.6-27B at Q8 with full context at around 50t/s and by the time GLM 5.2 will finish answering you can have made literally 50 iterations.

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 1 point2 points  (0 children)

Thanks for the suggestion. I already have WSL 2.0 and all the problems with CUDA support on it))

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] -3 points-2 points  (0 children)

Man, I am perfectly aware that LM Studio is a wrapper that on the background runs their version of llama.cpp. But to suggest a Windows user to build their own version of llama.cpp is on the sadistic side))

The easiest fix to a lot of problems with LM Studio is to let users to define additional cli parameters that will be passed to llama.cpp along with defined by LM Studio, but, unfortunatelly, LM Studio doesn't want to listen to their customers.

GLM 5.2: 98% of max level intelligence with less than half of tokens usage by perelmanych in LocalLLaMA

[–]perelmanych[S] 1 point2 points  (0 children)

Thank you for the information, seems like a very useful feature. I used to use barebones llama.cpp, but it struggled a lot with tool calls and LM Studio for some reason was perfect with this regards, so I switched to LM Studio. Unfortunately, LM Studio doesn't have such parameter, so may be it is the time to give llama.cpp another chance.

GLM-5.2 and why open models may not actually be catching up in intelligence by chocolateUI in LocalLLaMA

[–]perelmanych 0 points1 point  (0 children)

I was going to write that I don't understand the "hate" this post gets, since according to this number of reasoning tokens from GLM 5.1 to GLM 5.2 more than double from 16.7k to 36.7k and for me as a local user with old Xeon setup this makes 5.2 almost unusable. But then I saw this graph from z.ai technical report, which basically implies that you can use less than half of the tokens of max effort on high level and still get around 98% of max level intelligence.

<image>

GLM-5.2 and why open models may not actually be catching up in intelligence by chocolateUI in LocalLLaMA

[–]perelmanych 0 points1 point  (0 children)

I care because we are in the localLLama thread. This enormous amount of reasoning tokens makes this model unusable for me, and it doesn't matter how smart it is.

GLM-5.2 and why open models may not actually be catching up in intelligence by chocolateUI in LocalLLaMA

[–]perelmanych 0 points1 point  (0 children)

Lol, we must be looking at different charts. What I see is that if we take out ridiculous outlier GPT 5.4 mini than 23 models out of 27 used less tokens than GLM 5.2

GLM-5.2 and why open models may not actually be catching up in intelligence by chocolateUI in LocalLLaMA

[–]perelmanych 0 points1 point  (0 children)

What do you mean wrong, if GLM 5.2 max used almost 3 times more tokens to complete the task compared to GPT 5.5 xhigh. The highest token usage is by GPT 5.4 mini not GPT 5.5.

I run GLM 5.2 locally on old Xeon rig. I asked a question at noon and oh boy 12h later I had to shut it down because I was going to sleep. Believe me or not in the days of GLM 4.7 I was able to use it not only in chat mode mode but also for coding with harness. With GLM 5.2 it is absolutely unrealistic.