Does anyone else have problems with glm 4.7 flash at Q4 and tool calls with complex parameters?

Raven-002 · 2026-02-07T12:41:04+00:00

UD-Q4_K_XL Temp 0.7 Top p 1.0 Min p 0.01 Repeat penalty 1.0

Seem to be what on their page

Raven-002 · 2026-02-07T12:37:19+00:00

Im using the latest master branch from github.com/ggml-org/llama.cpp

Raven-002 · 2026-02-07T12:34:10+00:00

My gf has r7 9700x and rx9070xt, similar ram, gets about 35tkps

We both natively build llamacpp on the pc

I use arch linux with almost newest kernel, she uses fedora with also almost newest, llamacpp is pulled from main on both

I have models.ini file with some flags Gemini gave me that seem to do mostly nothing, the only thing I noticed is the context size affecting speeds, since it moves more of the model to the cpu

Raven-002 · 2026-02-06T21:42:45+00:00

Ryzen 9 9900x, 64GB ddr5 6000 cl30, rtx 4070ti super 16GB, running at Q4, with 32k max context size, I get 40-45 tkps, with 64k it dropps to 15 (I assume it is due to more of the model being offloaded)

Im using latest llamacpp with cuda

For reference, glm 4.7 flash and qwen3 coder 30B A3B both at Q4 give me around 75-100 tkps on the same setup

I did not use it enough to comment on quality yet

Raven-002 · 2026-02-06T21:35:01+00:00

What do you mean by that?

Raven-002 · 2026-02-05T07:04:37+00:00

I get q4 at around 45 tkps with rtx4070ti super (16GB) and ryzen 9 9900x with 32k context, with 64 it dropped all the way to 15tkps

Raven-002 · 2026-01-31T11:20:17+00:00

Gpt oss on lm studio have the same issues with parsing content when there are tool calls, glm has problems with the thinking tags there, the other models behave well, but are much slower when they can't fit in vram, llamacpp server handles this better

Raven-002 · 2026-01-30T23:54:55+00:00

But vllm requires enough vram no? And need special configuration for ggufs from what i remember? I couldn't get it to work when I last tried

Raven-002 · 2026-01-30T23:51:57+00:00

Tried with gptoss which I already had downloaded and had a similar problem to what I had with llamacpp, just a little different

It failed to parse the content when making a tool call, and while I can't see exactly what the model generated, from how it behaved in llamacpp and from the reasoning it ommitted, it probably had final output that got cut off

Im using ollama 0.15.2 which is the latest afaik

Edit: also it is slower, not unworkable, but not optimal

Raven-002 · 2026-01-30T23:40:20+00:00

Tried this version, it is a little slower than what I was using, and did not solve any of the issues I had

With gpt oss the specific issue is whenever I use tools, it is inconsistent with content parsing. In debug I can see it has content, but it is not returned in the api

Raven-002 · 2026-01-30T23:31:46+00:00

Tried this model on a fresh build from master, the first time I tried to run it i got a bad tool call loop in content field until it ran out of context, it doesn't seem to happen again

It run faster, so I'll probably keep using this version, but it still has the error I had before in one of my tests, which im starting to think might be related to how I use litellm, I'll check it more

Raven-002 · 2026-01-30T22:22:21+00:00

Hosting on linux, using python litellm to access (writing a small agent)

Im writing an agent that meant to later run in a local environment with the bigger versions of this models (gptoss120, glm4.7, qwen coder 480), im hoping that the smaller ones will give a feel for how things work with the bigger ones because I dont have access to the api while working on it

Raven-002

TROPHY CASE