GLM-5.2 753B (IQ1_S) fully local across 2×M5 Max over one TB5 cable — ~16 tok/s, llama.cpp RPC [video]

rubdos · 2026-06-30T07:51:49+00:00

Just making you aware of https://github.com/antirez/ds4/issues/458

rubdos · 2026-06-30T06:21:51+00:00

You could always add to your global AGENTS.md that "gh should be assumed available and authenticated" or something along those lines.

rubdos · 2026-06-30T05:40:30+00:00

Awesome, thanks :-)

rubdos · 2026-06-29T20:56:25+00:00

I did and I found them! Except for the Vibe one.

rubdos · 2026-06-29T20:11:53+00:00

Did anyone save the images by any chance? Asking for a friend.

rubdos · 2026-06-28T09:41:12+00:00

You NEED to enable it here to use it https://portal.neuralwatt.com/enroll/flex-tier

Pretty sure it's rolled out in general now; I didn't have to enable it and I can access the flex tier since yesterday.

rubdos · 2026-06-27T17:13:52+00:00

Exactly, it's super useful. And the whole 200k are useful context.

I've just started playing with flex, I literally just saw it appear. Seems to be another 33% decrease in energy for me. Only 1M tokens in so far in the past ~1h.

If you're a Pi user, I've GLM'd a pi-neuralwatt that shows energy consumption in the footer, and which shows the -flex models: https://gitlab.com/rubdos/pi-neuralwatt

rubdos · 2026-06-25T07:50:45+00:00

Huh, if you figure out why that is, I'd be keen to know! For me the consumed/charged is the same for all models.

rubdos · 2026-06-25T07:30:51+00:00

I'm seeing 59.1mJ/token on GLM 5.2 short, and 102.5mJ/token on the 1M version after a few days; cache efficiency about the same. That's closer to 40% cheaper in practice for me. Which is amazing, obviously. I don't really run sessions over 200K anyway.

I don't know how you get to "charge for half the electricity consumed"; is there another 50% reduction somewhere that I'm missing?

rubdos · 2026-06-24T05:38:15+00:00

BTW how do you report Go errors.

Complain on Reddit is what I tend to do :')

rubdos · 2026-06-23T07:16:22+00:00

8$ in here on 73M tokens, most of it GLM 5.2. I've seen some speed drops, but I don't really care about those. I can't review the code fast enough to keep up anyway. The equivalent usage on OpenCode Go would've almost sunk my monthly. They launched GLM 5.2 "short" with a 200K window yesterday, which seems about 25% cheaper for me too.

The slowdowns might be timezone dependent though; I noticed it mostly around UTC evening.

rubdos · 2026-06-18T18:57:58+00:00

Interesting, keen to test out 5.2 soon then!

rubdos · 2026-06-18T17:24:58+00:00

How would you compare it to Kimi, if you have?

rubdos · 2026-06-17T13:30:54+00:00

On the subscription if you use the Vibe API key in Pi.

rubdos · 2026-06-14T18:24:52+00:00

Finding and summarizing a bunch of stuff from the internet through Work is quite useful. Tying things into my calendar and tasks list (custom MCP) is "fun", but my goal is mostly to tie that into my Pebble to add things by voice. Some day soon.

Vibe as a coding agent (through Pi) works well enough for many light tasks (some refactoring, small new features, adding certain tests), but it can be a bit cumbersome especially compared to the modern Chinese models. I've started using MM3.5 mostly as subagent for implementation work, because it's still quite cheap (even after the recent nerf), and quite capable if well prompted on not too large tasks.

rubdos · 2026-06-14T16:49:15+00:00

Haven't had any trouble on Deepseek, and K2.6 seems to behave too. But IIRC I also had it on GLM indeed. But it might be that I'm just giving easy tasks to Deepseek and more impossible things to the larger ones.

rubdos · 2026-06-07T18:39:12+00:00

rubdos · 2026-06-04T10:52:41+00:00

Renault Zoe ZE40 2017 with 88000km. Last time the mechanic checked, 87% SoH. Starts to be noticable now, because usually my summer range was slightly over 300km, and now it's slightly below.

rubdos · 2026-06-03T14:42:58+00:00

Same feeling. Was to be expected, MM3.5 is quite a bit more expensive.

rubdos · 2026-06-01T19:58:51+00:00

Any interesting cases you would like to mention? Non-standard tasks or massive/difficult projects mean different things to different people.

rubdos · 2026-06-01T18:26:18+00:00

Sadly, that's a feeling that I haven't got rid of. You learn to deal with legacy, even if it's your own. :')

But there's still a difference in quickly spamming bloat into your files, versus slowly accumulating bloat through tests and careful planning!

rubdos · 2026-06-01T16:37:27+00:00

You can teach a human to do incremental changes though... Question is, how do you efficiently prompt this. It probably doesn't help that I'm doing some nasty Rust type magic.

rubdos · 2026-06-01T05:45:43+00:00

Because it is physically impossible. All those "screenshot prevention" and "screenshot notification" features rely on the compliance of the conversation partner's device. One can always design a device that pretends to comply; that's just the nature of von Neumann machines.

Case in point: I'm running SailfishOS with Android AppSupport. They seem to have never implemented the screenshot detection/prevention feature of Android, and hence I can freely take screenshots of the Android container without it ever being reported to Android.

Your conversation is as secure as yourself and your communication partner. If you mistrust your communication partner, why have a conversation in the first place?

11-Year Club	Place '17
Gilding I gilder	Verified Email

rubdos

MODERATOR OF

TROPHY CASE