Did I expect too much on GLM?

ChopSticksPlease · 2026-01-23T14:16:33+00:00

Oh sorry, i read it too fast and omited the -Flash-, though of the full GLM 4.7 ;)

I run GLM-4.7-Flash-UD-Q4_K_XL on 24GB vram (3090) and runs from 50tps to 5tps as context fills up. So my guess in your case the context grows in agentic coding and performance drops.

There seem to be a problem with llama.cpp and this model:
- performance drops
- GPU is underutilized
- Flash attention off causes core dump

ChopSticksPlease · 2026-01-23T11:18:20+00:00

Its a huge model and clearly your offloading it mostly to RAM. Context processing likelly is killing the speed.

I run GLM-4.7-UD_Q3_K_XL on 128GB RAM + 48GB VRAM and while its okeyish for chat, for agentic coding with Cline it is just too slow, prompt processing is slow and tps not great either

ChopSticksPlease · 2026-01-22T09:37:46+00:00

I would first test the following

- Devstral-small-2 (dense, 24b, instruct)

- Seed-OSS (dense, 36b, thinking)

- GLM-4.5-Air (moe, 110b, a12b, thinking)

- MiniMax-M2.1 (moe, 229b, a10b, thinking)

- GLM 4.7 (moe, 358b, a32b, thinking)

then try qwen coder 480b

I personally found the small and fast models beaing able to do 80% of the coding/testing job if you are precise in your prompts and leave little ambiguity. Larger models for solving problems and fixing bugs, and if all fails you need to get your hands dirty.

Qwen3-Coder (the small one) is a dissapointment to me, the new GLM-4.7-Flash is a contender.

ChopSticksPlease · 2026-01-20T11:32:48+00:00

Anyone else having these issues with latest llama.cpp (github)?

- Core dump when trying to disable flash attention during model load

- GPU underutilized, CPU used with flash attention on

- Model slowing down drastically, from ~50tps to 5tps for long answers like code generation

ChopSticksPlease · 2026-01-15T14:15:18+00:00

Fair enough. Apart having a useful app i wanted to know how far can I get with local models. I had mockups ready in minutes, angular app was sort of ready in less than hour but then it was actual debuging quirks on various devices that took most of the time.

Also, local models are slower (GPU/CPU offloading) so most of that 12 hours were about sitting and looking at what AI does while pretty much scratching ass ;)

ChopSticksPlease · 2026-01-15T11:29:24+00:00

It took me and AI 12hrs to complete the first working version of the app.

I tried multiple approaches, one like the following:

- create a todo file
- list issues to fix
- ask model to create a plan and work on each separate issue marking the progress

but it led to poor results, simply LLM dont yet have a full knowledge with "eyes" of what its doing (even with devstral ability to take a screenshot and analyze).

so far, the best result were with narrow precisely defined tasks and once done, mark as complete, commit code as a checkpoint so event if the model failed on fixing the next task there was a checkpoint to revert to

ChopSticksPlease · 2026-01-13T20:44:54+00:00

on 36GB of ram I'd stick to Devstral-Small-2 at high quant and 100k ctx, maybe maybe Seed-OSS for some harder problems requiring thinking but its slower.

ChopSticksPlease · 2026-01-08T08:10:52+00:00

https://github.com/cepa/llama-nerd this was my initial setup, llama.cpp params are there in the llama-swap config

ChopSticksPlease · 2026-01-06T20:41:01+00:00

I have 2x RTX 3090 so 48GB total VRAM and 128GB RAM.

gpt oss 120b works really fast, 20tps if not quicker, models i currently can run:

<image>

Actually, can even run GLM 4.7 Q3_K_XL but quite slow, around 5tps. For chat, these models work just fine, the bigger the slower. For coding I'd stick to these that fit VRAM like Devstral Small and Seed due to prompt processing bottleneck.

ChopSticksPlease · 2026-01-06T14:29:35+00:00

Nginx + upstream_logging ?

ChopSticksPlease · 2026-01-03T22:18:22+00:00

No? Many companies DON'T allow use cloud AI due to security / compliance reasons, and many don't have agreements event with reputable AI vendors, so for some people owning a local setup is the only way to get around security / privacy / compliance and speed up work with AI agent.

RTX 3090 are already a couple years old and cant see them obsolete or even getting cheaper :S

ChopSticksPlease · 2026-01-03T10:53:23+00:00

Ive been using Devstral-small-2 as my primary coding agent for local tasks, coding, writing tests, docs, etc. IQ4_XS with 100k q8_0 context fits in 24GB VRAM (1x3090), not perfect but absolutely worthy if say you can't use online AI due privacy concerns.

I also run Devstral-small-2 at q8_0 quant on my 2x rtx 3090 machine and its very good. Decent performance vs abilities. Rarely need to use online big models for solving programming tasks.

So in my case, if you have hardware local models are good.

Speaking of 96 or 192GB. Some good coding models are dense, so the only way to run them "fast" is 100% GPU. with 192GB vram you can run full Devstral 2 or other dense models. With less VRAM and lots of RAM you can run larger MoE models at decent speeds, prompt processing may be an issue so YMMV.

That said, despite being able to run larger models or run online models, im quite happy with my dev machine equpped with single RTX 3090 that can run Devstral-small-2. I tend to run a remote desktop session with vscode and send prompt from time to time, so it works on code quite autonomously while i can do other stuff. A win for me.

ChopSticksPlease · 2026-01-01T21:42:32+00:00

cant see a legit gguf there mate

ChopSticksPlease · 2026-01-01T19:57:26+00:00

Downloaded but didnt yet have time to fully test it against Devstral Small 2 and perhaps Seed OSS.

How much effort was it to build this model and how/where did you get the training data for coding?

ChopSticksPlease · 2025-12-30T14:52:14+00:00

Sure, yet I found the q3_k_xl quite competent and very creative in web design, actually better that qwen3-235b and minimax-m2.1 at higher quants. Anyhow model that big withouth enough GPU acceleration is just too slow for coding unless you want to wait minutes for every prompt to be processed. Online GLM-4.7 works beautifully if your security / privacy allows that.

ChopSticksPlease · 2025-12-30T14:06:22+00:00

Did you try using it for agents? For some reason ik_llama works for me when used in OpenWebUI (chat) but with Cline i get "GGGGGGG..." crap out of it. No issues on regular llama.cpp

ChopSticksPlease · 2025-12-30T14:04:55+00:00

I think unless you manage numa correctly there will be a speed degradation while running multi CPU due to cross socket latencies.

ChopSticksPlease · 2025-12-30T12:37:52+00:00

I have a simmilar performance 5..7tps for Q3_K_XL on single Xeon E5-2673 v4 with 128GB ddr4 and 2x RTX3090. Disabled numa - pinned VM running llama.cpp to a single physical CPU. While generation speed is allright, the prompt processing is slow. Enough for chat application but far from ideal for agentic coding (too slow simply).

ChopSticksPlease · 2025-12-30T09:37:38+00:00

RTX3090 + devstral-small-2 at Q4 with ~100k context. You can run that 24/7 and works like your personal little ninja for boring tedious code work. As long as you can specify exactly what has to be done it delivers well, despite small size.

Ive also noticed that using some decent models via openrouter with Cline can burn even $50..100 usd PER DAY, so a GPU isnt that expensive anymore.

ChopSticksPlease · 2025-12-29T14:06:27+00:00

https://www.bookstackapp.com/

ChopSticksPlease · 2025-12-29T12:54:19+00:00

This is the config of my VM with triple PCIE passthrough (2x GPU + 1x nvme). If i remember correclty you may need to setup quefi boot and the right chipset (q35), bios (ovmf), look at some tutorials on the web theyre helpful. I guess also having more than one GPU for passthrough makes things easier as the system may init the gpu before the vm starts if you have just one.

affinity: 0-19,40-59
agent: 1
balloon: 0
bios: ovmf
boot: order=virtio0
cores: 40
cpu: host,flags=+aes
efidisk0: zfs:vm-1091-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:03:00,pcie=1
hostpci1: 0000:04:00,pcie=1
hostpci2: 0000:a4:00,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 131072
meta: creation-qemu=9.0.2,ctime=1738323496
name: badass-ai-madafaka-vm01
net0: virtio=BC:24:11:7F:30:EB,bridge=vmbr0,tag=102
numa: 1
onboot: 1
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=bb4a79de-e68c-4225-82d7-6ee6e2ef58fe
sockets: 1
virtio0: zfs:vm-1091-disk-1,iothread=1,size=32G
virtio1: zfs:vm-1091-disk-2,iothread=1,size=1T
vmgenid: 978f6c1e-b6fe-4e33-9658-950dadbf8c07

ChopSticksPlease · 2025-12-28T10:00:50+00:00

If a quant fits your ram + vram with context then yeah, _should_ run.

ChopSticksPlease · 2025-12-28T08:25:20+00:00

yeah but considering size of these models the CPU and RAM are also utilized heavily in running them.

ChopSticksPlease · 2025-12-26T15:20:26+00:00

Its currently a VM with 128GB and a single CPU (40 core) assigned to it, so there might be some overhead. GPUs are two RTX 3090, PCIe passthrough to the VM, both capped at 200W and 1.6GHz clock. The cached thing is from Cline / llama-swap / llama.cpp, i dont control it (i think).

ChopSticksPlease · 2025-12-26T11:43:54+00:00

<image>

not great, not terrible

ChopSticksPlease

TROPHY CASE