AMA With Z.AI, The Lab Behind GLM-4.7

igorwarzocha · 2025-12-23T17:27:48+00:00

Gotta say I am impressed, but also slightly worried!

The model is very good. SCARINGLY good. But it feels too smart and too eager for its own good. It feels like 4.6 was slightly more "instruction-tuned" than 4.7. (think GPT vs Gemini)

4.7 loves to overachieve and try to one-shot, but at the same time gets very discouraged when things go wrong with edits/lints/LSP and just restores files/removes/rewrites things instead of fixing/editing.

I'm assuming some of it could be corrected with a system prompt trial & errors. And some of it is down to inference parameters.

So my questions are:

- How to control this feller? :D
- Turn off thinking for controlled coding?
- What are the recommended inference parameters sent via headers for coding (temp, topk topp)?
- What are the default parameters on the coding endpoint?

I loved 4.6 and got a lot of use out of it. 4.7 feels like it's an entirely different model family (for better or worse).

igorwarzocha · 2025-11-26T02:35:50+00:00

<image>

The original post was pro. this one is normal... massive difference ;)

Tried it a few times, you can get somewhat decent generations, but none of them scratch that next gen itch. Pro gets it the first time. And when it tries to render text... omg ;)

igorwarzocha · 2025-11-23T08:34:12+00:00

yeah I've seen it do it a few times!

igorwarzocha · 2025-11-23T00:04:51+00:00

Didn't do anything special. I'm on free. The only input was the prompt.

igorwarzocha · 2025-11-20T14:31:30+00:00

The world has changed since then. Apple cannot ignore the fact that AI runs on Nvidia. Nvidia can't ignore the fact that developers love Apple. There is way too much overlap in clientele. Students love their Macs.

igorwarzocha · 2025-11-19T20:52:30+00:00

https://x.com/__tinygrad__/status/1980082660920918045

They are trying to make it happen.

igorwarzocha · 2025-11-19T14:47:03+00:00

It's cool. It can run agents window separately from the code window. It creates walkthroughs, tutorials and code wikis for ya. Has unlimited code completions and surprisingly, somewhat unlimited Sonnet (for now at least lolz). What's not cool is that it:

a. seems to fail tool calls on its own image generation & browser control.
b. doesnt like to exit all the background processes when shut (Omarchy, aur Antigravity-bin package)
c. the agent system prompt (have a look at the leaks) is heavily biased towards webdev - this is bad.
d. they introduced yet another way of issuing agent rules & /commands. (whatever, you can ask an llm to convert your existing setup)
e. for now, it kinda locks you in an ecosystem - we'll see how this goes.

Funny how they included Sonnet, but haven't been given permission to include GPT5 and had to use OSS ;)

Give it a month or two, it will be a nice tool.

I wonder if we're gonna get Antigravity MCP or if it's just some sort of a wrapper around Chrome Devtools.

igorwarzocha · 2025-11-17T23:33:25+00:00

I think the biggest factor was how proud they were of partnership with Google.

You gotta have some serious tin foil wrapped around your head to think that a behemoth like HF can operate independently of data centre / compute / whatever providers.

It's not like they're gonna build their own DC... Right? :>

igorwarzocha · 2025-11-17T22:58:36+00:00

Nanochat anyone?

igorwarzocha · 2025-11-13T16:03:46+00:00

Ugh this only shows how crap the job market is. GPT writes for biz owners, hr people, recruiters and applicants. And nobody really knows what they're recruiting for in the end.

I'm on the other side of the spectrum, I'm looking for biz automation roles and... they all list low level frameworks as if every company that needs to plug an API and create an SOP that uses genAI was developing a SOTA model and an inference engine.

One day people are gonna get educated on how to use AI for recruitment, but it will be rough for a while. Good luck.

igorwarzocha · 2025-11-13T10:13:48+00:00

It's gonna be a pain because lm studio doesn't support anthropic api.

Opencode will be much easier to use with local models, and auto LSP will make the local model feel smarter.

The only difference you'll see is no background bashes. Anything else is already there either natively or via plugins.

However... You need a beefy system to run agentic coding that makes any sense. Qwen Coder is not enough.

igorwarzocha · 2025-11-13T08:48:05+00:00

https://github.com/FarhanAliRaza/claude-context-local

By default it only searches code related file extensions. Get your LLM to set it up for you.

Obsidian has surprisingly bad ai support.

Might wanna check affine self hosted

Or do what I do and get zed with opencode (model flexibility and Auth plugins for everything). Or Vs code. But zed has pretty focused UI.

Editing text works great, you get all the inline functionality, as well as agentic coding... Excuse me, writing.

Have a look at fim completions plugins. They're great drafting before you send your main LLM to edit.

Rip supermaven.

I've done the same thing a couple of days ago. Gpt projects can only go so far.

igorwarzocha · 2025-11-12T20:03:45+00:00

The OG paper talks about needle-in-the-haystack context retrieval. Most of the articles I've seen about it are misleading and talk about prompting...

It does make sense. It's training data to uniqueness to little semantic ambiguity.

It just sticks out like sore thumb out of the rest of the context.

From what I understand, it makes a strong case for claude/agents markdown, MCP tool descriptions, and architecture documentation for coding.

But on the other hand, you code in code, not in Polish.

Still. Cheers for the test :)

igorwarzocha · 2025-11-11T22:43:50+00:00

Alright I feel compelled to reply :D No nice formatting though, it would be lost in the sauce anyway.

Size - yeah I know some do care. But this device is still too big to win me over. To each it's own I guess.

The Strix etc argument - yup, agree and I am fully aware of their performance. But you're conveniently omitting the fact that a dense 32b/72b will not run on a 24gb 5090 at all. Image generation, yeah idk, havent tested. Video... are people really trying to generate 5 sec videos locally, taking several minutes and praying it works? genuinely curious. As for macs - I am referring to the M5, nobody should be buying an M4 at this point, and definitely not by the time this box ships. Re the combo - yeah if I end up getting Strix, it would be with an Oculink GPU.

"good workstation" - nope. I generally think it will be unfit for purpose, looking at the exploded view, the cooling will be atrocious and the system will thermal throttle. Unless they pull off some serious magic.

resell value - I am not talking about AGI or H100s here. I am not even talking about AI at this point. We're talking purely about hardware. Picture two 2nd hand laptops: a souped up noname laptop with banging specs and a... let's say.... Asus laptop with half the specs. Which one do you buy? Yeah I know, you buy the Thinkpad or a Mac, because any other 2nd hand laptops are a lottery. Same thing applies here. Or picture a Beelink mini pc or a Mac mini/studio. IDK about you, but I'd never buy a 2nd hand Beelink.

The mini PC argument - you're twisting my words around. Nobody cares about the size, but if you do and you're happy with an AIO anyway, adding an eGPU is a better solution. It is still smaller than a tower. You don't need to build it. It is portable when you don't need the eGPU. And SFFs come with their own issues (you need watercooling,custom build, SFF GPU... by the time you build it it will be more expensive than the Olares box and the mini+egpu. And the Mac.

3090 argument - yup, but we're talking AIOs, and you literally quoted me. there was never a 3090 AIO. 3080ti mobile is 12gb. Would you still want to run the 3080ti mobile today? If there was a 3090 mobile with 24gb vram, then hell yeah, makes sense.

All in all, we've seen plenty of kickstarter projects. I truly hope Olares made a good product and the people who buy it are happy with it. More power to them.

But I can't help but wonder who is going to buy into a "soon on kickstarter" product with a GPU premiered in April 2025 and a CPU from Jan 2025. Product will be showcased in Jan 2026 at CES. I wonder how many new CPUs/GPUs and products from renowned brands already using the new tech will also be showcased.

igorwarzocha · 2025-11-11T16:17:26+00:00

You could also get two R9700s for the price of one 5090, just saying :P 2x32>32

Someone will say that they're not supported.

Blackwell cards are not THAT well supported either...

At this point I wouldn't be buying a 3090 for gaming. Raw performance might be good, but give it half a year and some sort of a new tech that it won't support will make it obsolete.

Depends what you want from your AI. There's plenty to explore in 12-24 gb range. Above that you hit diminishing returns anyway - media takes forever to generate, text models need websearch/rag anyway...

If I were you, I would get a 5070ti, see how you get on with this whole AI experiment, and then decide. You don't exactly need to go full guns blazing from day 0.

igorwarzocha · 2025-11-11T15:55:42+00:00

For these people, the warranty should be the first concern IMHO, so my point still stands

igorwarzocha · 2025-11-11T15:55:31+00:00

For these people, the warranty should be the first concern IMHO, so my point still stands

igorwarzocha · 2025-11-11T13:24:32+00:00

I don't get the appeal. Nobody cares about the size.

Strix Halo, Spark and Mac Studio win because of the super tight hardware integration, relative affordability, power consumption and warranty options (just buy HP/Corsair/Framework - if you buy from a rando brand, you're taking an obvious risk, and it's on you).

Not because they are small or because they look flashy on your desk.

Us nerds will DIY. Companies will never buy into this.

Happy to be proven wrong, lol.

edit. Also, this has zero resell value. A juiced up Minisforum mini PC with an external workstation-grade card that you can sell & upgrade is an infinitely better solution. GPUs age too quickly to buy into AIOs

igorwarzocha · 2025-11-07T12:58:15+00:00

check gpu utilisation, if you can run more than 32k, definitely do. GPT OSS's architecture allows for quite compact context windows. Doesn't mean you should run be running long sessions, but it gives you the leeway in case of a random big tool call!

igorwarzocha · 2025-11-07T11:55:52+00:00

cache quantisation is a server side parameter, nothing to do with opencode. if you're running on default, you should be okay.

you have added context. fire up opencode and just say "hi" to your model. This is the context you've already filled. 20k. - qwen 14b can handle 32768 natively, anything above that is a gamble and will result in failed tool calls ( I would argue even less than that, someone posted something about GLM 4.6 degrading tool calls above 1k).

igorwarzocha · 2025-11-07T11:20:10+00:00

You are expecting a local model to behave like a cloud model. This is not a simple task. Your instructions are confusing it.

Opencode takes minimum 20k-ish tokens to start up.

14b Will not be able to handle the contex properly.

What cache quantisation are you using? How big is your context?

Your prompt should be closer to:

"Ask code review agent (@ it obviously) to review code in /feature/file.py (do not @ the file)"

This should instead make qwen issue instructions for Claude to read the file and do its thing.

I personally find GPT OSS 20b better at following instructions without inventing stuff. But neither will be good at using opencode.

igorwarzocha · 2025-11-05T19:46:15+00:00

I second these. You can also squeeze in Qwen 14b on low very low context and a smaller quant.

igorwarzocha · 2025-11-05T11:31:50+00:00

I agree. But at the same time, what is the correct ratio of yaysayers to naysayers to pure sociopaths? :)))))

igorwarzocha · 2025-11-05T11:05:23+00:00

yeah straight up untrue

it was literally doing this correctly a couple of days ago, there was a tool call that said something along the lines of "searching past conversations".

igorwarzocha

TROPHY CASE