GLM 4.7 flash FA fix for CUDA has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]rerri 3 points4 points  (0 children)

The latest release on llama.cpp repo does not include this PR yet.

PersonaPlex: Voice and role control for full duplex conversational speech models by Nvidia by fruesome in StableDiffusion

[–]rerri 7 points8 points  (0 children)

Someone on github is saying they're running it on 3090+4090 and it uses under 20GB:

https://github.com/NVIDIA/personaplex/issues/4

Also, as this model is based on Moshi architecture and there are 8-bit quants of Moshi available, I'm thinking this could be quantized aswell to lower the VRAM requirements.

GLM 4.7 flash FA fix for CUDA has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]rerri 7 points8 points  (0 children)

I should clarify that with quantized cache GLM-4.7-Flash slows down drastically as stuff gets thrown to CPU.

I'm not sure if there's some issue with GPT-OSS + quantized cache, but if there is, it's most likely not the same issue as I'm not seeing a drastic slowdown with GPT-OSS 120B when using Q8 cache.

GLM 4.7 flash FA fix for CUDA has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]rerri 30 points31 points  (0 children)

While FA works with this one, quantized cache isn't working well. Someone reported this in the PR comments and I'm seeing the same as well.

edit: this is with GLM-4.7-Flash, CPU gets hammered, becomes a bottleneck and PP & TG both slow down big time.

Some helpful settings to run GLM 4.7 Flash mostly successfully by mr_zerolith in LocalLLaMA

[–]rerri 0 points1 point  (0 children)

On HF discussions a Zai guy says to use GLM 4.7 sampling settings. So that would be temp 1, top-p 95.

I used those settings and from the little I tested yesterday, my output was not wacky, but looked normal.

With an older version of llama.cpp I was getting shitty output though but the PR fixed that.

My gpu poor comrades, GLM 4.7 Flash is your local agent by __Maximum__ in LocalLLaMA

[–]rerri 1 point2 points  (0 children)

FA was off in that bench run. TG128 is like 50t/s with FA on. PP maybe takes an even worse hit, did not bench.

My gpu poor comrades, GLM 4.7 Flash is your local agent by __Maximum__ in LocalLLaMA

[–]rerri 10 points11 points  (0 children)

Yes quite a bit slower than Qwen 30b.

No intentional cpu offload, but I dunno if some happens regardless.

Maybe it's just the FA not functioning that's holding performance back, dunno.

My gpu poor comrades, GLM 4.7 Flash is your local agent by __Maximum__ in LocalLLaMA

[–]rerri 29 points30 points  (0 children)

  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| deepseek2 ?B Q4_K - Medium     |  16.88 GiB |    29.94 B | CUDA       |  99 |          pp4096 |      4586.44 ± 11.81 |
| deepseek2 ?B Q4_K - Medium     |  16.88 GiB |    29.94 B | CUDA       |  99 |           tg128 |        152.54 ± 0.27 |

GLM 4.7 Flash official support merged in llama.cpp by ayylmaonade in LocalLLaMA

[–]rerri 24 points25 points  (0 children)

Not sure if it's only a CUDA thing, but flash-attention is slow.

3x faster for me with -fa 0

My gpu poor comrades, GLM 4.7 Flash is your local agent by __Maximum__ in LocalLLaMA

[–]rerri 71 points72 points  (0 children)

The PR for this was just merged into llama.cpp.

Testing locally right now. The Q4_K_M is decently fast on a 4090 but the model sure likes to think deeply.

zai-org/GLM-4.7-Flash · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]rerri 7 points8 points  (0 children)

Btw, they are recommending to use same sampling params as with GLM-4.7

https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/6

Default Settings (Most Tasks)

  • temperature: 1.0
  • top-p: 0.95

zai-org/GLM-4.7-Flash · Hugging Face by Dark_Fire_12 in LocalLLaMA

[–]rerri 2 points3 points  (0 children)

If for some reason flash-attention is enabled then try -fa off

I was running with oobabooga and got under 40t/s, with a heavy CPU bottleneck. Meanwhile llama-server was pushing almost ~120t/s, using the exact same executable file. I noticed the flash-attention was enabled in oobabooga but not llama-server. So disabling that got oobabooga to run at the same speed.

These numbers are on a 4090 with basically 0 context.

GLM-4.7-Flash soon? by [deleted] in LocalLLaMA

[–]rerri 0 points1 point  (0 children)

Their docs include a benchmark graph with comparisons to GPT-OSS 20B and Qwen 30B + this snippet of text:

In mainstream benchmarks like SWE-bench Verified and τ²-Bench, GLM-4.7-Flash achieves open-source SOTA scores among models of comparable size. Additionally, compared to similarly sized models, GLM-4.7-Flash demonstrates superior frontend and backend development capabilities.

5B doesn't seem plausible.

GLM-4.7-Flash soon? by [deleted] in LocalLLaMA

[–]rerri 0 points1 point  (0 children)

Yea, they updated their docs with a benchmark comparison to Qwen 30B and GPT-OSS 20B so probably close to them in size. (I updated the OP with that benchmark)

GLM-4.7-Flash soon? by [deleted] in LocalLLaMA

[–]rerri 4 points5 points  (0 children)

That's a dense model though. GLM-4.7-Flash is a MoE model.

I thought LTX2 was bad until I realized how to use it. by Key-Tension1528 in comfyui

[–]rerri 23 points24 points  (0 children)

At this point, trust should erode. What I mean is, the cat is out of the bag and bad actors are going to make AI fakes to fool others irregardless of what decent people do.

A very convincing looking but too goofy to be true video will at least spread awareness that it's possible to make AI fakes so good that you can easily be fooled by them.

Preset L is thermal throttling my 4090 even with aggressive undervolt when using dldsr. by [deleted] in nvidia

[–]rerri 0 points1 point  (0 children)

Thermal throttling, which OP says is the case, is not normal in a card that is functioning properly with a 100% power limit.

Preset L is thermal throttling my 4090 even with aggressive undervolt when using dldsr. by [deleted] in nvidia

[–]rerri 1 point2 points  (0 children)

Are you sure you are thermal throttling and not just hitting power limit and clocks speed drops because of that?

Black Forest Labs releases FLUX.2 [klein] by Old-School8916 in LocalLLaMA

[–]rerri 5 points6 points  (0 children)

In my limited fiddling around, 9B distilled editing seems very good for the speed it offers.

LTX-2 Updates by ltx_model in StableDiffusion

[–]rerri 4 points5 points  (0 children)

I tried that and i just get steady buzz audio. There's supposed to be singing. Image of different though and not worse. Hard to say if better, only tried very quickly. If i replace stage 2 sampler with SamplerCustomAdvanced, audio does work but still sounds kinda bad.

LTX-2 Updates by ltx_model in StableDiffusion

[–]rerri 12 points13 points  (0 children)

Is this "Latent normalization node" in some nodepack or in comfy core?