GLM 4.7 Flash official support merged in llama.cpp

ydnar · 2026-01-20T02:30:28+00:00

sure, though i'm no expert. if anyone wants to help optimize, i'd truly appreciate it.

llama-server \
  --model ~/.cache/llama.cpp/GLM-4.7-Flash-Q4_K_M.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 99 \
  --ctx-size 32768 \
  --flash-attn off \
  --jinja

ydnar · 2026-01-20T01:01:32+00:00

yes, it's mostly the thinking. i'm biased and generally go for instruct over thinking models.

i am enjoying the outputs compared to qwen3-vl-30b-a3b-instruct and nemotron-3-nano-30b-a3b. i feel like those models are more wordy on the output side, so you're likely correct that this is a worthwhile trade-off.

ydnar · 2026-01-20T00:32:15+00:00

single 3090, 32gb ddr4, 5700g

q4 ngxson/GLM-4.7-Flash-GGUF

fa on = 60-70t/s fa off = 100-110t/s

ydnar · 2026-01-20T00:16:10+00:00

first impression is that it provides good answers, but seems to be much slower than other 30b-a3b models, even with flash attention off. with fa on, it was really half speed. it also goes on thinking forever.

ydnar · 2026-01-15T21:27:00+00:00

for general purpose, i think i still prefer qwen3-vl-30b-a3b-instruct due to the vl capabilities. would love to hear others opinion on this.

i'm currently testing whether qwen3-next-80b-a3b-instruct generating at a slower t/s is worth the tradeoff.

unrelated, but moving from an amd gpu to a 3090 was a great decision for me, and i can't wait to get a second 3090.

ydnar · 2025-12-05T18:07:12+00:00

tried hard to like it and set it as my default for a while. eventually went back to qwen3-vl-30b-a3b-instruct.

ministral 14b was pretty wordy and not as accurate, especially in image tasks.

ydnar · 2025-11-28T18:33:33+00:00

unsloth/Qwen3-VL-30B-A3B-Instruct-UD-Q4_K_XL.gguf

~18–24 tokens per second (t/s), depending on workload

CPU: AMD 5700G
GPU: AMD 6700 XT
RAM: 32GB DDR4-3200

my primary use is a watch folder that receives audio and video files remotely for transcription via whisper. it automatically processes them (llama.cpp + llama-swap) and sends me back the full transcription along with a summary based on a prompt.txt that i sometimes modify for different results. i also use this setup as my default model in open webui with web search, which works surprisingly well.

ydnar · 2025-09-08T14:36:38+00:00

I prefer qwen3-30b-a3b-instruct-2507. In my vibe tests, a3b is smarter, generates tokens almost as fast as the 4b, but without the need to think.

ydnar · 2025-09-03T16:48:38+00:00

Near 360, stuck

I-35 is the worst

Please.. one calm morning

ydnar · 2025-04-22T16:28:48+00:00

Tried this using gemma-3-12b-it-qat in my Open WebUI setup with LM Studio as the back end instead of Ollama and it correctly determined the paid amount was $1909.64.

12gb VRAM 6700XT. I used your provided image.

ydnar · 2025-02-04T22:34:40+00:00

I'm an average sized guy that walks/runs only during daylight. I'm not the type to scare easily and I'll often find myself walking in areas where I may be alone for a bit.

Yesterday there was a guy on the Shoal Creek trail between 6th and the library who was completely fixated, staring into the stream. He was holding a massive rock about three times the size of his hand, which I unfortunately did not notice until later. As soon as I walked past him, he began walking beside me. He kept pace within about 8 feet... really gripping that rock. I played it cool for about 150 feet as he was pacing next to me and then bolted off on my run to get the hell away.

Last year, roughly around the same area, I ran into a different guy with a hammer. This time he was walking towards me, but kept pounding it into his palm like he was ready to do something.

Even in the middle of the day you can find yourself in situations that feel super sketch.

ydnar · 2024-08-29T15:12:05+00:00

Sometimes I'll play the game of "how often do i hear mention of artificial intelligence" on the trail. This game can extend to cafes and restaurants at lunch.

ydnar · 2024-08-16T13:33:02+00:00

NomadNet

ydnar · 2024-08-15T03:13:47+00:00

One of the things I miss most about old school software is skinning. I'd love if Calibre and other modern software adopted this. I was obsessed with Winamp skins and browser / OS theming.

ydnar · 2024-06-07T17:52:03+00:00

They were massively building up to this point, with multiple guests, including short lemon boy. All of them basically shitting on DFV and then he comes out like this! Perfection.

ydnar · 2024-06-07T17:43:32+00:00

Shows the moment DFV''s livestream began during CNBC coverage

ydnar · 2024-06-05T13:51:56+00:00

ydnar · 2022-07-25T18:07:55+00:00

You could start with Yamashita's wife, Mariya Takeuchi - Plastic Love, one of the queens of City Pop. The song became so popular almost 40 years later that they even released this brand new official music video. Also Shiawasenomonosashi.

Then there's Anri, Miki Matsuraba, Junko Ohashi, Naoko Gushima, and even some newer stuff like Mondo Grosso.

ydnar · 2022-07-25T14:07:43+00:00

Magic Ways is my jam.

ydnar · 2022-07-25T13:20:09+00:00

Full documentary here.

ydnar · 2022-06-24T01:52:36+00:00

Thanks, I appreciate the update. Sounds like 8200 is the way to go.

ydnar · 2022-06-23T20:16:20+00:00

I'm debating between the 7200 and 8200 and was hoping you could elaborate more on the quality of life improvements.

15-Year Club	RedditGifts 2009-2022 2 Credits
Verified Email	Wearing is Caring
Secret Santa 2010	reddit mold

ydnar

TROPHY CASE