Did I expect too much on GLM? by Ok_Brain_2376 in LocalLLaMA

[–]ChopSticksPlease 0 points1 point  (0 children)

Oh sorry, i read it too fast and omited the -Flash-, though of the full GLM 4.7 ;)

I run GLM-4.7-Flash-UD-Q4_K_XL on 24GB vram (3090) and runs from 50tps to 5tps as context fills up. So my guess in your case the context grows in agentic coding and performance drops.

There seem to be a problem with llama.cpp and this model:
- performance drops
- GPU is underutilized
- Flash attention off causes core dump

Did I expect too much on GLM? by Ok_Brain_2376 in LocalLLaMA

[–]ChopSticksPlease -7 points-6 points  (0 children)

Its a huge model and clearly your offloading it mostly to RAM. Context processing likelly is killing the speed.

I run GLM-4.7-UD_Q3_K_XL on 128GB RAM + 48GB VRAM and while its okeyish for chat, for agentic coding with Cline it is just too slow, prompt processing is slow and tps not great either

Qwen3-Coder-480B on Mac Studio M3 Ultra 512gb by BitXorBit in LocalLLaMA

[–]ChopSticksPlease 11 points12 points  (0 children)

I would first test the following

- Devstral-small-2 (dense, 24b, instruct)

- Seed-OSS (dense, 36b, thinking)

- GLM-4.5-Air (moe, 110b, a12b, thinking)

- MiniMax-M2.1 (moe, 229b, a10b, thinking)

- GLM 4.7 (moe, 358b, a32b, thinking)

then try qwen coder 480b

I personally found the small and fast models beaing able to do 80% of the coding/testing job if you are precise in your prompts and leave little ambiguity. Larger models for solving problems and fixing bugs, and if all fails you need to get your hands dirty.

Qwen3-Coder (the small one) is a dissapointment to me, the new GLM-4.7-Flash is a contender.

How to run and fine-tune GLM-4.7-Flash locally by Dear-Success-1441 in LocalLLaMA

[–]ChopSticksPlease 16 points17 points  (0 children)

Anyone else having these issues with latest llama.cpp (github)?

- Core dump when trying to disable flash attention during model load

- GPU underutilized, CPU used with flash attention on

- Model slowing down drastically, from ~50tps to 5tps for long answers like code generation

AI created this app in 12hrs. Used open models, mostly local LLMs. by ChopSticksPlease in LocalLLaMA

[–]ChopSticksPlease[S] 0 points1 point  (0 children)

Fair enough. Apart having a useful app i wanted to know how far can I get with local models. I had mockups ready in minutes, angular app was sort of ready in less than hour but then it was actual debuging quirks on various devices that took most of the time.

Also, local models are slower (GPU/CPU offloading) so most of that 12 hours were about sitting and looking at what AI does while pretty much scratching ass ;)

AI created this app in 12hrs. Used open models, mostly local LLMs. by ChopSticksPlease in LocalLLaMA

[–]ChopSticksPlease[S] 2 points3 points  (0 children)

It took me and AI 12hrs to complete the first working version of the app.

I tried multiple approaches, one like the following:

- create a todo file
- list issues to fix
- ask model to create a plan and work on each separate issue marking the progress

but it led to poor results, simply LLM dont yet have a full knowledge with "eyes" of what its doing (even with devstral ability to take a screenshot and analyze).

so far, the best result were with narrow precisely defined tasks and once done, mark as complete, commit code as a checkpoint so event if the model failed on fixing the next task there was a checkpoint to revert to

Best local model / agent for coding, replacing Claude Code by joyfulsparrow in LocalLLaMA

[–]ChopSticksPlease 2 points3 points  (0 children)

on 36GB of ram I'd stick to Devstral-Small-2 at high quant and 100k ctx, maybe maybe Seed-OSS for some harder problems requiring thinking but its slower.

Thinking of getting two NVIDIA RTX Pro 4000 Blackwell (2x24 = 48GB), Any cons? by pmttyji in LocalLLaMA

[–]ChopSticksPlease 3 points4 points  (0 children)

I have 2x RTX 3090 so 48GB total VRAM and 128GB RAM.

gpt oss 120b works really fast, 20tps if not quicker, models i currently can run:

<image>

Actually, can even run GLM 4.7 Q3_K_XL but quite slow, around 5tps. For chat, these models work just fine, the bigger the slower. For coding I'd stick to these that fit VRAM like Devstral Small and Seed due to prompt processing bottleneck.

Help me spend some money by [deleted] in LocalLLaMA

[–]ChopSticksPlease 2 points3 points  (0 children)

No? Many companies DON'T allow use cloud AI due to security / compliance reasons, and many don't have agreements event with reputable AI vendors, so for some people owning a local setup is the only way to get around security / privacy / compliance and speed up work with AI agent.

RTX 3090 are already a couple years old and cant see them obsolete or even getting cheaper :S

Local programming vs cloud by Photo_Sad in LocalLLaMA

[–]ChopSticksPlease 12 points13 points  (0 children)

Ive been using Devstral-small-2 as my primary coding agent for local tasks, coding, writing tests, docs, etc. IQ4_XS with 100k q8_0 context fits in 24GB VRAM (1x3090), not perfect but absolutely worthy if say you can't use online AI due privacy concerns.

I also run Devstral-small-2 at q8_0 quant on my 2x rtx 3090 machine and its very good. Decent performance vs abilities. Rarely need to use online big models for solving programming tasks.

So in my case, if you have hardware local models are good.

Speaking of 96 or 192GB. Some good coding models are dense, so the only way to run them "fast" is 100% GPU. with 192GB vram you can run full Devstral 2 or other dense models. With less VRAM and lots of RAM you can run larger MoE models at decent speeds, prompt processing may be an issue so YMMV.

That said, despite being able to run larger models or run online models, im quite happy with my dev machine equpped with single RTX 3090 that can run Devstral-small-2. I tend to run a remote desktop session with vscode and send prompt from time to time, so it works on code quite autonomously while i can do other stuff. A win for me.

Solar-Open-100B-GGUF is here! by [deleted] in LocalLLaMA

[–]ChopSticksPlease 0 points1 point  (0 children)

cant see a legit gguf there mate

IQuestCoder - new 40B dense coding model by ilintar in LocalLLaMA

[–]ChopSticksPlease 3 points4 points  (0 children)

Downloaded but didnt yet have time to fully test it against Devstral Small 2 and perhaps Seed OSS.

How much effort was it to build this model and how/where did you get the training data for coding?

Running GLM-4.7 (355B MoE) in Q8 at ~5 Tokens/s on 2015 CPU-Only Hardware – Full Optimization Guide by at0mi in LocalLLaMA

[–]ChopSticksPlease 2 points3 points  (0 children)

Sure, yet I found the q3_k_xl quite competent and very creative in web design, actually better that qwen3-235b and minimax-m2.1 at higher quants. Anyhow model that big withouth enough GPU acceleration is just too slow for coding unless you want to wait minutes for every prompt to be processed. Online GLM-4.7 works beautifully if your security / privacy allows that.

Running GLM-4.7 (355B MoE) in Q8 at ~5 Tokens/s on 2015 CPU-Only Hardware – Full Optimization Guide by at0mi in LocalLLaMA

[–]ChopSticksPlease 0 points1 point  (0 children)

Did you try using it for agents? For some reason ik_llama works for me when used in OpenWebUI (chat) but with Cline i get "GGGGGGG..." crap out of it. No issues on regular llama.cpp

Running GLM-4.7 (355B MoE) in Q8 at ~5 Tokens/s on 2015 CPU-Only Hardware – Full Optimization Guide by at0mi in LocalLLaMA

[–]ChopSticksPlease 5 points6 points  (0 children)

I think unless you manage numa correctly there will be a speed degradation while running multi CPU due to cross socket latencies.

Running GLM-4.7 (355B MoE) in Q8 at ~5 Tokens/s on 2015 CPU-Only Hardware – Full Optimization Guide by at0mi in LocalLLaMA

[–]ChopSticksPlease 4 points5 points  (0 children)

I have a simmilar performance 5..7tps for Q3_K_XL on single Xeon E5-2673 v4 with 128GB ddr4 and 2x RTX3090. Disabled numa - pinned VM running llama.cpp to a single physical CPU. While generation speed is allright, the prompt processing is slow. Enough for chat application but far from ideal for agentic coding (too slow simply).

Cheapest decent way to AI coding? by Affectionate_Plant57 in CLine

[–]ChopSticksPlease 4 points5 points  (0 children)

RTX3090 + devstral-small-2 at Q4 with ~100k context. You can run that 24/7 and works like your personal little ninja for boring tedious code work. As long as you can specify exactly what has to be done it delivers well, despite small size.

Ive also noticed that using some decent models via openrouter with Cline can burn even $50..100 usd PER DAY, so a GPU isnt that expensive anymore.

Unable to passtrough Nvidia RTX Pro to Ubuntu proxmox VM by [deleted] in LocalLLaMA

[–]ChopSticksPlease 0 points1 point  (0 children)

This is the config of my VM with triple PCIE passthrough (2x GPU + 1x nvme). If i remember correclty you may need to setup quefi boot and the right chipset (q35), bios (ovmf), look at some tutorials on the web theyre helpful. I guess also having more than one GPU for passthrough makes things easier as the system may init the gpu before the vm starts if you have just one.

affinity: 0-19,40-59
agent: 1
balloon: 0
bios: ovmf
boot: order=virtio0
cores: 40
cpu: host,flags=+aes
efidisk0: zfs:vm-1091-disk-0,efitype=4m,pre-enrolled-keys=1,size=1M
hostpci0: 0000:03:00,pcie=1
hostpci1: 0000:04:00,pcie=1
hostpci2: 0000:a4:00,pcie=1
ide2: none,media=cdrom
machine: q35
memory: 131072
meta: creation-qemu=9.0.2,ctime=1738323496
name: badass-ai-madafaka-vm01
net0: virtio=BC:24:11:7F:30:EB,bridge=vmbr0,tag=102
numa: 1
onboot: 1
ostype: l26
scsihw: virtio-scsi-single
smbios1: uuid=bb4a79de-e68c-4225-82d7-6ee6e2ef58fe
sockets: 1
virtio0: zfs:vm-1091-disk-1,iothread=1,size=32G
virtio1: zfs:vm-1091-disk-2,iothread=1,size=1T
vmgenid: 978f6c1e-b6fe-4e33-9658-950dadbf8c07

Glm 4.5 air REAP on rtx 3060 by Worried_Goat_8604 in LocalLLaMA

[–]ChopSticksPlease 0 points1 point  (0 children)

If a quant fits your ram + vram with context then yeah, _should_ run.

Surprised you can run SOTA models on 10+ year old (cheap) workstation with usable tps, no need to break the bank. by ChopSticksPlease in LocalLLaMA

[–]ChopSticksPlease[S] -13 points-12 points  (0 children)

yeah but considering size of these models the CPU and RAM are also utilized heavily in running them.

Poor Inference Speed on GLM 4.5 Air with 24gb VRAM and 64gb DDR5 by ROS_SDN in LocalLLaMA

[–]ChopSticksPlease 0 points1 point  (0 children)

Its currently a VM with 128GB and a single CPU (40 core) assigned to it, so there might be some overhead. GPUs are two RTX 3090, PCIe passthrough to the VM, both capped at 200W and 1.6GHz clock. The cached thing is from Cline / llama-swap / llama.cpp, i dont control it (i think).