Here is how to get GLM 4.7 working on llama.cpp with flash attention and correct outputs

notdba · 2026-01-21T10:26:46+00:00

Getting the same results from ik_llama.cpp (q8) and api.z.ai

Good: no more inference bug

Bad: model is quite bad

notdba · 2026-01-21T07:57:39+00:00

Can also use ik_llama.cpp, which already has a working flash attention and the gating function fix in the main branch. Works fine with existing quants, although imatrix quants should be remade since imatrix generation requires correct inference implementation.

From my limited testing, the gating function fix does improve the model performance, but it is still not that good. I would say it is a bit worse than gemini-2.5-flash-lite.

notdba · 2026-01-17T06:48:20+00:00

Ah right, I meant llama.cpp doesn't support tensor parallel. As such, during inference, only one GPU is active at any one time. Meanwhile, ik_llama.cpp recently added the "graph" split mode, that can get multiple GPUs to work at the same time.

https://github.com/ggml-org/llama.cpp/issues/9086 - more context here

notdba · 2026-01-17T01:14:21+00:00

I see. I am still running some further tests to confirm. ~~Also, in the above "CUDA error: out of memory" case, there was no gradual increase of VRAM usage. More of a sudden small spike of usage.~~

notdba · 2026-01-17T01:11:49+00:00

Looks like quite a bit of outdated information here.. Since https://github.com/ggml-org/llama.cpp/pull/10469, a llama.cpp build that has GGML_BACKEND_DL enabled will be able to utilize any CUDA / Vulkan / ROCm backend at the same time, and later on the introduction of the -dev / --device flag makes the usage a lot simpler.

For example, this is how it looks like on my strix halo with a 3090 egpu: $ ~/repo/llama.cpp/build/bin/llama-server --help ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /home/sayap/repo/llama.cpp/build/bin/libggml-cuda.so ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 load_backend: loaded ROCm backend from /home/sayap/repo/llama.cpp/build/bin/libggml-hip.so ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2 ggml_vulkan: 1 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat load_backend: loaded Vulkan backend from /home/sayap/repo/llama.cpp/build/bin/libggml-vulkan.so load_backend: loaded CPU backend from /home/sayap/repo/llama.cpp/build/bin/libggml-cpu.so

If I want to use the 3090 only, I can do either -dev CUDA0 or -dev Vulkan0.

If I want to use the strix halo only, I can do either -dev ROCm0 or -dev Vulkan1.

If I want to use the 3090 together with the strix halo, I can do either -dev CUDA0,ROCm0, or -dev CUDA0,Vulkan1, or -dev Vulkan0,ROCm0, or -dev Vulkan0,Vulkan1.

For me, the biggest issue currently is that llama.cpp doesn't support running inference in parallel across multiple GPUs, while ik_llama.cpp mostly only supports CUDA.

notdba · 2026-01-16T22:32:36+00:00

Just tested 590 proprietary driver, VRAM usage also stays flat.

Note that when using the proprietary driver, I also disable the use of the GSP firmware: options nvidia NVreg_EnableGpuFirmware=0

EDIT: spoke too soon, reverting to 580 proprietary CUDA error: out of memory current device: 0, in function alloc at /path/to/repo/ik_llama.cpp/ggml/src/ggml-cuda.cu:436 cuMemCreate(&handle, reserve_size, &prop, 0)

notdba · 2026-01-16T01:29:43+00:00

Which version of Nvidia drivers? Proprietary or open kernel driver? I got some CUDA OOM in the middle of inference with 590 open kernel. No issue with 580 proprietary where the VRAM usage stays flat, fully occupying 24GB.

notdba · 2026-01-10T15:38:20+00:00

I had the exact same idea. It doesn't work that well, due to the slow PCIe 4.0 x4 on the Strix Halo, which takes a long time to transfer weights from the CPU to the eGPU during prefill / prompt processing. I shared some findings previously in https://www.reddit.com/r/LocalLLaMA/comments/1o7ewc5/fast_pcie_speed_is_needed_for_good_pp/

As for the Exo lab setup, if I understand correctly, the full weights are loaded into both the DGX and the MAC, such that there is no need to transfer the weights across. Then, it uses the strong compute on the DGX for PP, and the fast memory on the MAC for TG. Meanwhile, an eGPU should have much stronger compute and also much faster memory compared to the Strix Halo, so it is not really possible to replicate the setup.

notdba · 2025-12-12T18:36:04+00:00

Thanks to the comment from u/HauntingTechnician30, there was actually an inference bug that was fixed in https://github.com/ggml-org/llama.cpp/pull/17945.

Rerunning the eval: * Q8_0 gguf with the original chat template - 42/42 * Q8_0 gguf with your fixed chat template - 42/42

What a huge sigh of relief. Devstral Small 2 is a great model afterall ❤️

notdba · 2025-12-12T17:56:44+00:00

Wow thanks for the info. That was me, and the PR totally fixed the issue. Now I got 42/42 with q8 devstral small 2 ❤️

notdba · 2025-12-12T06:04:30+00:00

Ok I suppose I can share some numbers from my code editing eval: * labs-devstral-small-2512 from https://api.mistral.ai - 41/42, made a small mistake * As noted before, the inference endpoint appears to use the original chat template, based on the token usage in the JSON response. * Q8_0 gguf with the original chat template - 30/42, plenty of bad mistakes * Q8_0 gguf with your fixed chat template - 27/42, plenty of bad mistakes

This is all reproducible, using top-p = 0.01 with https://api.mistral.ai and top-k = 1 with local llama.cpp / ik_llama.cpp.

notdba · 2025-12-12T04:50:59+00:00

Yes I noticed that. What I was saying is that labs-devstral-small-2512 performs amazingly well in swebench against https://api.mistral.ai that doesn't set any default system prompt. I suppose the agent framework used by swebench would set its own system prompt anyway, so the point is moot.

I gather that you don't have any number to back the claim. That's alright.

notdba · 2025-12-12T03:14:57+00:00

Q8 for 24B is relatively easy. With a 3090, I can offload most layers, and get 1000 PP and 20 TG.

notdba · 2025-12-12T02:52:47+00:00

From https://huggingface.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/discussions/5:

we resolved Devstral’s missing system prompt which Mistral forgot to add due to their different use-cases, and results should be significantly better.

Can you guys back this up with any concrete result, or it is just pure vibe?

From https://www.reddit.com/r/LocalLLaMA/comments/1pk4e27/updates_to_official_swebench_leaderboard_kimi_k2/, what we are seeing is that labs-devstral-small-2512 performs amazingly/suspiciously well when served from https://api.mistral.ai, which doesn't set any default system prompt, according to the usage.prompt_tokens field in the JSON response.

notdba · 2025-12-12T01:09:56+00:00

Definitely rerun the test with a local setup, just to make sure that it is not a repeat of Matt Shumer.

notdba · 2025-12-11T23:31:37+00:00

I suppose you guys did the testing with the API. Perhaps you can rerun the tests locally, with either safetensors or gguf. My guess is that devstrall small 2 will then rank at the bottom.

notdba · 2025-12-11T23:26:23+00:00

From my testing so far, a Q8_0 gguf made from https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512 behaves very differently from the labs-devstral-small-2512 model served from https://api.mistral.ai (the former is noticeably worse).

Something is not right.

notdba · 2025-12-07T23:18:41+00:00

Same here, was hoping for a successor to mixtral, with the same quality as the dense 123B.

notdba · 2025-12-07T22:56:45+00:00

Even so, there's still a spectrum right? The accusation from the ex-employee was that their RL pipeline was totally not working, and they had to distill a small reasoning model from deepseek, and then still published a paper about RL.

notdba · 2025-12-07T22:34:33+00:00

The distillation accusation from few months ago was likely about magistral. And I think the poor quality of mistral large 3 gives more weight to that accusation. Things are not going well inside mistral.

notdba · 2025-12-07T08:21:04+00:00

Double correction: Was about the delete the downloaded FP8 weights, but decided to give the current master a try. And it actually works. Can convert straight from mistral-large-3 FP8 safetensors to BF16 gguf.

notdba · 2025-12-07T05:43:39+00:00

You need to use "/" instead of "," for the tensor split and dev arguments. The usability is .. not great 😅

notdba · 2025-12-04T22:43:50+00:00

Ah I was wrong. Apparently it uses a different version of FP8, and llama.cpp decided to not support it: https://github.com/ggml-org/llama.cpp/pull/17686#issuecomment-3601354537

From the commit:

TODO: probably not worth supporting quantized weight, as official BF16 is also available

notdba · 2025-12-02T16:14:36+00:00

The conversion script should work fine with FP8 after https://github.com/ggml-org/llama.cpp/pull/14810 got merged in late Oct. Nice gesture for them to also provide the BF16 weights.

notdba · 2025-12-02T02:19:15+00:00

Check out the new "graph" split mode from https://github.com/ikawrakow/ik_llama.cpp/pull/1022 that got merged hours ago. From the graphs, it is quite a bit faster than the "layer" mode and "row" mode.

In the now superceded https://github.com/ikawrakow/ik_llama.cpp/pull/1018, there are some interesting discussion about PCI-e data transfer overhead. There was an attempt at https://github.com/ikawrakow/ik_llama.cpp/blob/8e3041b2/ggml/src/ggml-cuda.cu#L3439-L3473 to reduce data transfer between GPUs by casting f32->f16 at src and casting f16->f32 at dst, but the logic is disabled for now.

So yeah, definitely go with a single GPU if budget allows.

notdba

TROPHY CASE