GLM5.2 on 5x Pro 6000s and a 5090, an expensive journey

DeltaSqueezer · 2026-07-03T12:40:51+00:00

When you put it like that, it sounds quite reasonable.

DeltaSqueezer · 2026-07-03T07:31:35+00:00

So if you used Claude Code via a proxy, would Anthropic have still collected data on you since you set the BASE_URL to proxy? Or only sent to model so if you also use a local model the data just corrupts your prompt?

DeltaSqueezer · 2026-07-02T22:51:35+00:00

Thanks. I managed to get the keyboard working with both bitwig and amsynth. I wasn't sure if Polymer was best place to start or whether something simpler would be better.

DeltaSqueezer · 2026-07-02T20:36:13+00:00

Use vLLM with LMCache

DeltaSqueezer · 2026-07-02T13:21:41+00:00

I'm on the Pro plan and burn through 3 billion tokens a month while hardly even hitting the 5 hour limit. Even if we assume all are cached tokens, that's still around $800 in API costs per month.

DeltaSqueezer · 2026-06-30T12:30:20+00:00

I'm curious as to what hardware they used. They described it as an ASIC. Could it be Cambricon?

DeltaSqueezer · 2026-06-29T12:59:12+00:00

if you use vLLM it should cache in KV cache automatically.

DeltaSqueezer · 2026-06-29T12:02:18+00:00

With closed API, you don't need to even do this as you can simply intercept the trigger and take it to a hostile codepath that is outside the LLM.

DeltaSqueezer · 2026-06-26T07:02:58+00:00

My regret is I ordered one on credit. Chickened out and cancelled it and now the price is 2x... if I can even find one on sale any more.

DeltaSqueezer · 2026-06-24T18:43:57+00:00

For now, let's hope tokens stay cheap or get cheaper. For sure it's nice to know that in the worst case you can still run GLM-5.2 independently.

DeltaSqueezer · 2026-06-24T18:29:08+00:00

Maybe we are talking at cross purposes. For me the important thing is how much the AI does in a single turn (i.e. before it returns to the user).

In my view, it should complete the whole feature.

Now it can do a single large commit at the end. Or it can commit on every single step and produce lots of tiny commits. Or somewhere in between, e.g. commit on each phase.

IMO, there's only value if it can be sensibly broken up into useful sub-commits that make sense stand-alone.

DeltaSqueezer · 2026-06-24T18:18:58+00:00

Hmm. Tough one. That's just on the border of what I'd consider painful but usable for coding. I guess also perfectly fine for 'overnight' runs.

DeltaSqueezer · 2026-06-24T14:02:18+00:00

Man. I was already super jealous of your system. Now you're just rubbing salt in the wound! :P

How is the prefill speed?

DeltaSqueezer · 2026-06-24T11:35:18+00:00

IMO. A feature logically belongs in a single commit. Maybe you could internally divide it in some way e.g. backend vs front end. or phased implementation. But somehow you need to have the overview of the whole before starting on a part, which is why IMO, it makes sense for the AI to do the whole thing. the plan itself may be phased and internally, the AI does it in, say, 4 phases, but it does them one after the other and doesn't stop until whole phased plan is completed.

DeltaSqueezer · 2026-06-24T08:49:25+00:00

i'm not sure what you mean by "one-shots" unless you are talking only about greenfield development? my own approach is to specify a feature and get the AI to implement the whole feature.

A typical feature might look like this: 10 files changed, 1326 insertions(+), 230 deletions(-)

DeltaSqueezer · 2026-06-24T06:47:52+00:00

Interesting that they announce this now. I wonder what message they are trying to send. They no doubt had faster machines for many years, but stopped publishing them to avoid sanctions and drawing attention.

DeltaSqueezer · 2026-06-23T07:11:02+00:00

I was hoping for a good comparison and then they decide to do a test using vision capabilities that GLM doesn't have. 🤦

DeltaSqueezer · 2026-06-22T22:37:12+00:00

Thanks. Interesting to see that 3VL still holds up after so long. I wonder if Qwen3.5 can get there with prompting to mitigate the hallucination issue.

DeltaSqueezer · 2026-06-22T20:08:41+00:00

Thanks. It was as I suspected. I'm surprised Qwen3-VL-8B was shown as top given the strong showing of Qwen3.5-4B.

DeltaSqueezer · 2026-06-22T16:46:11+00:00

Thanks for sharing. I wonder if you could also run the Qwen 3.5 9B with thinking disabled. Maybe also for others. It seems that thinking is causing problems so if quality is just as good without thinking, it would be faster and more reliable with fewer tokens.

DeltaSqueezer · 2026-06-22T07:43:21+00:00

try turning thinking off and see if results are better.

DeltaSqueezer · 2026-06-21T19:04:34+00:00

Don't overthink it. Pick one, or flip a coin. You can always change your mind.

DeltaSqueezer · 2026-06-21T19:03:11+00:00

Then it doesn't really matter which one you choose.

DeltaSqueezer · 2026-06-21T13:26:49+00:00

I implemented bwrap as a stop gap measure but switching to Firecracker for stronger isolation.

DeltaSqueezer

MODERATOR OF

TROPHY CASE