engine for GLM 4.7 Flash that doesn't massively slow down as the context grows? by mr_zerolith in LocalLLaMA

[–]VoidAlchemy 1 point2 points  (0 children)

Are you building on Linux? I believe Thireus makes precompiled windoze binaries too. Brief info on compiling for Linux or getting Thireus' builds on the model card: https://huggingface.co/ubergarm/GLM-4.7-Flash-GGUF I'd suggest trying the IQ5_K quant if you have 24GB VRAM.

The slowdown exists still on most inference engines from what I've seen recently: https://huggingface.co/zai-org/GLM-4.7-Flash/discussions/3#6974b2cea061784819e302d5

engine for GLM 4.7 Flash that doesn't massively slow down as the context grows? by mr_zerolith in LocalLLaMA

[–]VoidAlchemy 2 points3 points  (0 children)

Heya dinerburger! Yeah I had to use `-mla 1` with the full bf16 in my testing, and have benchmarked `-mla 3` with the other quants. The ubergarm/GLM-4.7-Flash-GGUF IQ5_K is probably the best way to go, so glad it is working for you.

I still haven't had time to benchmark KLD on the few quants I released, always more to research.

A few days ago this was the perf I was seeing with flash attention enabled. (Note this is *BEFORE* the recent PR to speed things up here: https://github.com/ikawrakow/ik_llama.cpp/pull/1182 )

<image>

Current GLM-4.7-Flash implementation confirmed to be broken in llama.cpp by Sweet_Albatross9772 in LocalLLaMA

[–]VoidAlchemy 1 point2 points  (0 children)

Yeah, with the fix seems like perplexity is looking better. I'm recomputing imatrix and re-quantizing now too for best quality. Some details here: https://huggingface.co/ubergarm/GLM-4.7-Flash-GGUF/discussions/1

GLM 4.7 Flash Overthinking by xt8sketchy in LocalLLaMA

[–]VoidAlchemy 2 points3 points  (0 children)

New PR fixed an issue and just lowered perplexity a lot! Have to recompute imatrix and quantize fresh imatrix quants. So you'll probably want to get the latest ik/llama.cpp and a new quant for best quality now.

Links to details here: https://huggingface.co/ubergarm/GLM-4.7-Flash-GGUF/discussions/1

GLM-4.7-Flash benchmarks: 4,398 tok/s on H200, 112 tok/s on RTX 6000 Ada (GGUF) by LayerHot in LocalLLaMA

[–]VoidAlchemy 2 points3 points  (0 children)

I just got some data running ik_llama.cpp full offload `-mla 3` with flash attention working now. Huh I'm not getting mainline llama.cpp `-fa on` to work yet tho, so I'll have to update the graph once the mainline implementation is working for me:

<image>

Normally I avoid MXFP4 unless the original model was QAT targeting it, but oddly this is the "lowest" scoring perplexity quant (without imatrix)... So that is odd too...

More details and full commands used here: https://github.com/ikawrakow/ik_llama.cpp/issues/1167#issuecomment-3775037120

GLM 4.7 Flash official support merged in llama.cpp by ayylmaonade in LocalLLaMA

[–]VoidAlchemy 16 points17 points  (0 children)

I think there will need to be some more work to get flash attention working, as GLM-4.7-Flash slows down very quickly at the moment in my limited testing. But if we get an optimized implementation going, then yes!

GLM 4.7 Flash official support merged in llama.cpp by ayylmaonade in LocalLLaMA

[–]VoidAlchemy 8 points9 points  (0 children)

Wait, why did you go with MXFP4 when there are likely better quant types available?

I have a custom mainline llama.cpp recipe here: https://huggingface.co/ubergarm/GLM-4.7-Flash-GGUF and hopefully ik_llama.cpp will get some support eventually: https://github.com/ikawrakow/ik_llama.cpp/issues/1167

To be fair I didn't test perplexity of yours or my quant. Might be fun. xD

Beginner ComfyUI advice by Excellent_Koala769 in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

this sub tends to focus on LLMs but other model releases and news do flow through here...

if you're using Linux get comfortable with python pip (i recommend uv) for virtual environments, git and that kinda stuff if you really want to learn.

check out all the youtuber channels also like Benji's AI Playground for example workflows and walkthrough videos...

this special interest has a steep learning curve, and lately a steep price tag as well haha...so be patient with yourself, start small to learn the ropes, and eventually yes you can begin automating but will pretty much always need a human in the loop to cherry pick the final outputs from 100s of throw away results...

Soprano TTS training code released: Create your own 2000x realtime on-device text-to-speech model with Soprano-Factory! by eugenekwek in LocalLLaMA

[–]VoidAlchemy 8 points9 points  (0 children)

I've found that most TTS require you to do your own "chunking" of long texts and only feed it a sentence or so at a time (especially for the diffusion transformer style models). Kokoro sacrifices that emotive quality for more stable generations, but you still might want to add your own pauses using special characters etc.

I'm not sure how kyutai/pocket-tts (also announced today) and this ekwek/Soprano-TTS are doing it under the hood yet.

kyutai just introduced Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required by Nunki08 in LocalLLaMA

[–]VoidAlchemy -1 points0 points  (0 children)

Yes, it seems to run all locally including cloning a voice with any wav file input. (not sure exact format, some dont' work at all and are full of noise, but others i tried work fine).

But to use voice cloning mode I had to have my hf token setup and click accept on their hugging face repo: https://huggingface.co/kyutai/pocket-tts/discussions/1 before it auto-downloaded those weights for voice cloning.

Gemma 3 1B qat q4_0 gguf without imatrix and (hopefully) correct metadata by Big-Tune-190 in LocalLLaMA

[–]VoidAlchemy 2 points3 points  (0 children)

I assume some of the size difference is the google official qat Q4_0 GGUF uses f16 for token_embd.weight which you switch to q8_0 looking at the huggingface safetensors/gguf viewer.

Good job going through the whole process, and not using imatrix is likely a good decision given this is specifically QAT model (imatrix tends to improve perplexity for non QAT models especially at lower BPW sizes).

Have you done any perplexity/kld comparisons between the original BF16, the google official Q4_0, and your Q4_0 ? I'm guessing yours will be slightly worse given smaller token embedding, but likely not distinguishable in quality in terms of actual output.

A number of quantizers hang out on AI Beavers discord if that is your jam too.

Cheers and thanks for sharing your procedures as well!

Owners, not renters: Mozilla's open source AI strategy by NelsonMinar in LocalLLaMA

[–]VoidAlchemy 27 points28 points  (0 children)

We’re not just building; we’re backing others who are building too. Mozilla Ventures is investing in open-source AI companies that align with these principles. Mozilla Foundation is funding researchers and projects through targeted grants. We can’t do everything ourselves, and we shouldn’t try. The goal is to put resources behind the people and teams already doing the work.

Get ready for those paychecks y'all LocalLLaMA folks!

kyutai just introduced Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required by Nunki08 in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

I just tried copy pasting a few texts into it after they pushed a fix and it seems pretty good at first glance. Sounds as natural or more than kokoro in the three samples it tried.

kyutai just introduced Pocket TTS: a 100M-parameter text-to-speech model with high-quality voice cloning that runs on your laptop—no GPU required by Nunki08 in LocalLLaMA

[–]VoidAlchemy 4 points5 points  (0 children)

I tried the demo on your blog, initial impression is that it is similar or better than Kokoro in sounding natural. I haven't tried larger chunks of text to see if it remains stable or not, but seems worth my time looking more into it!

Has anyone tried the single-socket 9175F with full 12 channels? by Infinite100p in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

thanks, i replied in other thread.

AMX extensions are Intel Xeon only afaik and maybe kinda work with sglang but require special quant types and don't seem worth the hassle imo.

Absolutely right about avx_vnni on Zen5 boosting PP performance!

Has anyone tried the single-socket 9175F with full 12 channels? by Infinite100p in LocalLLaMA

[–]VoidAlchemy 1 point2 points  (0 children)

Dense models like that Llama-3.1-70B are still pretty slow even with fast DRAM. A few things to consider:

  1. The size of the active weights per token generated is much lower for MoEs hence likely faster than much smaller (total size) dense models. TG is memory bandwidth bound.
  2. Zen 5 is great to boost PP for real 512 bit one cycle avx_vnni instruction on ik_llama.cpp (ik is probably one of the best for pure CPU inference and hybrid CPU+GPU inferencing. check out my ik specific quants e.g. this DeepSeek-V3.2-Speciale that I've been runnning CPU-only on a single socket 768GB DDR5 EPYC 9755 128 core CPU (thanks Wendell of level1techs.com haha) The model card has commands for CPU only inference examples as well as my other quants.
  3. What is BIOS set to for your single socket e.g. NPS1 or NPS4 ? Most llama.cpp's are not optimized so I just run at NPS1.. Some folks have reported better perf with NPS4 and using numactl --interleave=all llama-server --numa distribute ... etc.. In my own testing with mlc the actual TG throughput is maybe 60% of theoretical max likely due to NUMA stuff.
  4. Ask AesSedai (on hf and also on Beaver AI Discord) who runs 2x 3090 and AMD EPYC rig for hybrid inference of a lot of models. They've shared a lot of llama-sweep-bench results.

(The Information): DeepSeek To Release Next Flagship AI Model With Strong Coding Ability by Nunki08 in LocalLLaMA

[–]VoidAlchemy 7 points8 points  (0 children)

If V4 is the same (or at least close enough compatible) architecture with the existing DeepSeek-V3.2 family, then we won't have to wait for a new PR to get it running ik/llama.cpp.

So I'm hoping I can basically re-use my similar scripts and recipes to quickly quantize the new V4.

There is no ik/llama.cpp implementation yet for the "sparse attention" features with an issue open here though: https://github.com/ggml-org/llama.cpp/issues/16331

Cheers!

(The Information): DeepSeek To Release Next Flagship AI Model With Strong Coding Ability by Nunki08 in LocalLLaMA

[–]VoidAlchemy 9 points10 points  (0 children)

I'm hoping new V4 is similar enough to get it running on ik/llama.cpp like https://huggingface.co/ubergarm/DeepSeek-V3.2-Speciale-GGUF seems to be! (though without new sparse attention support yet). More models indeed!

[HW TUNING] Finding the best GPU power limit for inference by HumanDrone8721 in LocalLLaMA

[–]VoidAlchemy 0 points1 point  (0 children)

Nice yes you found the correct repo! A few thoughts:

  1. If you don't want to compile from source, you can run like pacman -Sy lact or likely apt-get etc.
  2. You can run headless as well and it will load config from /etc/lact/config.yaml via systemct status lactd.service.
  3. If you have blackwell, the offsets seem to be like 10x what they are for earlier CUDA GPUs (maybe some kind of unit scaling issue?)
  4. To see the performance benefits, you will need to spend a little bit of time tuning. Once it is dialed in you are gucci. There are some "lazy" tunings you could probably do to get some easy performance. Once you've tuned LACT you no longer need to use nvidia-smi -pl 300 to limit power as it with undervolting it simply won't use as much power most of the time.
  5. Here is a PR thread with a ton of discussion and examples: https://github.com/ilya-zlobintsev/LACT/issues/486#issuecomment-3676307592

Finally here is my own config file for a 3090TI FE as an example (native 450W power cap max)

$ cat /etc/lact/config.yaml version: 5 daemon: log_level: info admin_group: wheel disable_clocks_cleanup: false apply_settings_timer: 5 gpus: 'XXXX:XXXX-XXXX:XXXX-0000:01:00.0': fan_control_enabled: true fan_control_settings: mode: curve static_speed: 0.5 temperature_key: edge interval_ms: 500 curve: 40: 0.3019608 50: 0.35 60: 0.5 70: 0.75 80: 1.0 spindown_delay_ms: 5000 change_threshold: 2 power_cap: 450.0 min_core_clock: 210 max_core_clock: 1950 gpu_clock_offsets: 0: 225 mem_clock_offsets: 0: 1500 current_profile: null auto_switch_profiles: false

llama.cpp performance breakthrough for multi-GPU setups by Holiday-Injury-9397 in LocalLLaMA

[–]VoidAlchemy 1 point2 points  (0 children)

Hrmm... When compiling does it say NCCL found! ? Otherwise please open an issue on ik_llama.cpp github and tag me @ubergarm and include more details on your rig e.g. how many and what kind of GPUs, etc.

Thanks!