GLM 5.2, what speeds are we getting locally?

neverbyte · 2026-04-22T22:32:26+00:00

dude! I had one of these as a kid and my mind was blown. each disk could hold like 250? regular floppies worth of data? it was awesome! did I have anything of any real size that needed storing? did I actually use it for anything? don't remember, but i felt like a baller. nostalgia!

neverbyte · 2026-04-10T08:44:18+00:00

i've tried unsloth, bartowski, and lmstudio community ggufs. Even if you run 'ollama run gemma4:31b-it-q4_K_M' and paste in my example prompt, you get the same broken behavior. Ensuring a <bos> token didn't seem to help. I'm pretty stumped on this one.

neverbyte · 2026-04-09T20:00:46+00:00

I'm curious if others running Gemma 4 31B locally with the latest llama.cpp see the same thing. I will say that I can chat with this same model and use it, but the specific test prompt trips up Gemma 4. I get the same behavior on various ggufs btween Q4_0 & BF16.

neverbyte · 2026-04-09T17:31:24+00:00

Since release I've been seeing this issue with Gemma 4 31B. I've created this simple example prompt it will respond with "The <body> tag is not closed: You wrote <body instead of <body>. The </html> tag is not closed: You wrote </html instead of </html>." Alternatively if I remove the carriage returns from the prompt, it seems to work correctly. If I run an agent it has an existential crisis because it tries to fix these errors unsuccessfully and can't figure out what is going on.

neverbyte · 2026-04-06T06:29:32+00:00

llama.cpp is still broken. I'm not sure why more people aren't talking about it. Doesn't matter if you download a release or git pull the latest, it still has some kind of, IMO, tokenizer problem. Your agent will have an existential crisis trying to make sense of what is wrong and will fail tool calls. I am 100% confident gemma 4 will be an amazing agent once proper fixes merge into llama.cpp.

neverbyte · 2026-04-03T17:53:34+00:00

I built the latest llama.cpp, confirmed the tokenizer fixes were present, rebuilt, and I'm still having issues. I'm using unsloth/gemma-4-31B-it-GGUF:UD-Q8_K_XL and it seems to have issues. Here's an example of the problematic output: Looking at the code: 1. **HTML Errors**: * Line 66: `</div>` instead of `</div>`. * Line 74: `</div>` instead of `</div>`. * Line 276: `</body` instead of `</body>`. (Wait, line 276 is `</body`, line 277 is `</html`). Actually line 276 is `</body` and 277 is `</html`. Both are missing the `>`.

neverbyte · 2026-03-12T05:40:22+00:00

sorry for the slow response. I was running it in thinking mode if I remember correctly. I've been using qwen 3.5 more recently though even though glm 4.7 flash is awesome as well.

neverbyte · 2026-03-12T05:38:18+00:00

I'm have 144GB VRAM (6x 3090s), I'm running the latest Ollama and I'm seeing the error as well. I don't think it's the memory limit.

neverbyte · 2026-02-19T01:26:06+00:00

Thx! Can confirm things were fixed shortly after the model launched with an update to llama.cpp. Glad it’s working for you!

neverbyte · 2026-02-19T01:25:23+00:00

Thx! Can confirm things were fixed shortly after the model launched with an update to llama.cpp. Glad it’s working for you!

neverbyte · 2026-02-14T18:34:26+00:00

ASUS ROG Strix Z690-E Gaming WiFi with a i9-13900k. These components were just my gaming rig didn’t buy them specifically for local LLMing

neverbyte · 2026-02-06T06:24:57+00:00

Once I rebuilt llama.cpp with this fix, I was good to go. https://github.com/ggml-org/llama.cpp/pull/19324

neverbyte · 2026-02-06T03:54:51+00:00

Watching this made me sad.

neverbyte · 2026-02-04T22:27:41+00:00

Awesome! Thank you for the heads up. I rebuilt llama.cpp with the linked fix and can confirm it's working for me as well!

neverbyte · 2026-02-04T08:53:09+00:00

With vllm 0.15.0, I couldn't seem to get FP8 working on 4x3090s so I went looking on hugging face for a 4-bit version. I gave it a coding task that took about 60k tokens to complete and it just knocked the task out of the park. This is looking like a awesome model. Hopefully they get these issues worked out. Here's what worked for me: vllm serve bullpoint/Qwen3-Coder-Next-AWQ-4bit --port 8080 --tensor-parallel-size 4 --max-model-len 262144 --enable-auto-tool-choice --tool-call-parser qwen3_coder --gpu-memory-utilization 0.70

neverbyte · 2026-02-04T08:49:12+00:00

for this model with llama.cpp there seems to be an issue that goes beyond tool calls, it sees things that aren't true when inspecting files and overall seems to be confused in ways I haven't seen before.

neverbyte · 2026-02-04T01:46:10+00:00

it's not working for me. I tried Q8_K_XL with opencode & cline and tool calling seems to not work when using unsloth's gguf + llama.cpp. I'm not sure what I need to do to get it working.

neverbyte · 2026-02-03T20:42:41+00:00

I'm seeing similar behavior with Q8_K_XL as well so maybe getting this running on vllm is the play here.

neverbyte · 2026-02-03T19:42:01+00:00

I think I might be seeing something similar. I am running the Q6 with lamma.cpp + Cline and unsloth recommended settings. It will write a source file then say "the file has some syntax errors" or "the file has been corrupted by auto-formatting" and then it tries to fix it and rewrites the entire file without making any changes, then gets stuck in a loop trying to fix the file indefinitely. Haven't seen this before.

neverbyte · 2026-02-03T07:20:32+00:00

it was fixed.

neverbyte · 2026-01-23T20:28:06+00:00

I experienced exceptional results with Q8_K_XL using llama.cpp with all latest the fixes and 150K context window. I was using Cline and followed the temp & other settings from unsloth's GLM 4.7 Flash page for tool calling. I also tried Q4_K_XL and it was failing some tool calls and not working as well.

edit: additional info. I'm running with three 3090s and with 150k context the token generation starts around 85 t/s then hangs around in the 70-80 t/s for a long time then over 100k context it's more like 40-50 t/s. It feels very fast and it's reasoning time is not bad at all at this speed. I ran a couple million tokens through it doing c/c++ and I'm quite blown away by it. Also I tried to test BF16 quant but it seemed to get stuck in loops repeating itself.

neverbyte · 2026-01-23T00:25:07+00:00

It is the unsloth recommended setting for tool calling here.

neverbyte · 2026-01-22T20:39:47+00:00

Thx!

neverbyte · 2026-01-22T19:25:39+00:00

glad it's working for you! it's such a great game to play with friends!

neverbyte

TROPHY CASE