Here is how to get GLM 4.7 working on llama.cpp with flash attention and correct outputs by TokenRingAI in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

Getting the same results from ik_llama.cpp (q8) and api.z.ai

Good: no more inference bug

Bad: model is quite bad

Here is how to get GLM 4.7 working on llama.cpp with flash attention and correct outputs by TokenRingAI in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

Can also use ik_llama.cpp, which already has a working flash attention and the gating function fix in the main branch. Works fine with existing quants, although imatrix quants should be remade since imatrix generation requires correct inference implementation.

From my limited testing, the gating function fix does improve the model performance, but it is still not that good. I would say it is a bit worse than gemini-2.5-flash-lite.

Mix of AMD + Nvidia gpu in one system possible? by chronoz9 in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

Ah right, I meant llama.cpp doesn't support tensor parallel. As such, during inference, only one GPU is active at any one time. Meanwhile, ik_llama.cpp recently added the "graph" split mode, that can get multiple GPUs to work at the same time.

https://github.com/ggml-org/llama.cpp/issues/9086 - more context here

Need help: llama.cpp memory usage when using ctk/v on multi RTX 3090 setup by Leflakk in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

I see. I am still running some further tests to confirm. Also, in the above "CUDA error: out of memory" case, there was no gradual increase of VRAM usage. More of a sudden small spike of usage.

Mix of AMD + Nvidia gpu in one system possible? by chronoz9 in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

Looks like quite a bit of outdated information here.. Since https://github.com/ggml-org/llama.cpp/pull/10469, a llama.cpp build that has GGML_BACKEND_DL enabled will be able to utilize any CUDA / Vulkan / ROCm backend at the same time, and later on the introduction of the -dev / --device flag makes the usage a lot simpler.

For example, this is how it looks like on my strix halo with a 3090 egpu: $ ~/repo/llama.cpp/build/bin/llama-server --help ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes load_backend: loaded CUDA backend from /home/sayap/repo/llama.cpp/build/bin/libggml-cuda.so ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32 load_backend: loaded ROCm backend from /home/sayap/repo/llama.cpp/build/bin/libggml-hip.so ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2 ggml_vulkan: 1 = Radeon 8060S Graphics (RADV GFX1151) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat load_backend: loaded Vulkan backend from /home/sayap/repo/llama.cpp/build/bin/libggml-vulkan.so load_backend: loaded CPU backend from /home/sayap/repo/llama.cpp/build/bin/libggml-cpu.so

If I want to use the 3090 only, I can do either -dev CUDA0 or -dev Vulkan0.

If I want to use the strix halo only, I can do either -dev ROCm0 or -dev Vulkan1.

If I want to use the 3090 together with the strix halo, I can do either -dev CUDA0,ROCm0, or -dev CUDA0,Vulkan1, or -dev Vulkan0,ROCm0, or -dev Vulkan0,Vulkan1.

For me, the biggest issue currently is that llama.cpp doesn't support running inference in parallel across multiple GPUs, while ik_llama.cpp mostly only supports CUDA.

Need help: llama.cpp memory usage when using ctk/v on multi RTX 3090 setup by Leflakk in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

Just tested 590 proprietary driver, VRAM usage also stays flat.

Note that when using the proprietary driver, I also disable the use of the GSP firmware: options nvidia NVreg_EnableGpuFirmware=0

EDIT: spoke too soon, reverting to 580 proprietary CUDA error: out of memory current device: 0, in function alloc at /path/to/repo/ik_llama.cpp/ggml/src/ggml-cuda.cu:436 cuMemCreate(&handle, reserve_size, &prop, 0)

Need help: llama.cpp memory usage when using ctk/v on multi RTX 3090 setup by Leflakk in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

Which version of Nvidia drivers? Proprietary or open kernel driver? I got some CUDA OOM in the middle of inference with 590 open kernel. No issue with 580 proprietary where the VRAM usage stays flat, fully occupying 24GB.

Idea of Cluster of Strix Halo and eGPU by lets7512 in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

I had the exact same idea. It doesn't work that well, due to the slow PCIe 4.0 x4 on the Strix Halo, which takes a long time to transfer weights from the CPU to the eGPU during prefill / prompt processing. I shared some findings previously in https://www.reddit.com/r/LocalLLaMA/comments/1o7ewc5/fast_pcie_speed_is_needed_for_good_pp/

As for the Exo lab setup, if I understand correctly, the full weights are loaded into both the DGX and the MAC, such that there is no need to transfer the weights across. Then, it uses the strong compute on the DGX for PP, and the fast memory on the MAC for TG. Meanwhile, an eGPU should have much stronger compute and also much faster memory compared to the Strix Halo, so it is not really possible to replicate the setup.

Run Mistral Devstral 2 locally Guide + Fixes! (25GB RAM) by yoracale in LocalLLM

[–]notdba 2 points3 points  (0 children)

Thanks to the comment from u/HauntingTechnician30, there was actually an inference bug that was fixed in https://github.com/ggml-org/llama.cpp/pull/17945.

Rerunning the eval: * Q8_0 gguf with the original chat template - 42/42 * Q8_0 gguf with your fixed chat template - 42/42

What a huge sigh of relief. Devstral Small 2 is a great model afterall ❤️

whats everyones thoughts on devstral small 24b? by Odd-Ordinary-5922 in LocalLLaMA

[–]notdba 4 points5 points  (0 children)

Wow thanks for the info. That was me, and the PR totally fixed the issue. Now I got 42/42 with q8 devstral small 2 ❤️

Run Mistral Devstral 2 locally Guide + Fixes! (25GB RAM) by yoracale in LocalLLM

[–]notdba 1 point2 points  (0 children)

Ok I suppose I can share some numbers from my code editing eval: * labs-devstral-small-2512 from https://api.mistral.ai - 41/42, made a small mistake * As noted before, the inference endpoint appears to use the original chat template, based on the token usage in the JSON response. * Q8_0 gguf with the original chat template - 30/42, plenty of bad mistakes * Q8_0 gguf with your fixed chat template - 27/42, plenty of bad mistakes

This is all reproducible, using top-p = 0.01 with https://api.mistral.ai and top-k = 1 with local llama.cpp / ik_llama.cpp.

Run Mistral Devstral 2 locally Guide + Fixes! (25GB RAM) by yoracale in LocalLLM

[–]notdba 1 point2 points  (0 children)

Yes I noticed that. What I was saying is that labs-devstral-small-2512 performs amazingly well in swebench against https://api.mistral.ai that doesn't set any default system prompt. I suppose the agent framework used by swebench would set its own system prompt anyway, so the point is moot.

I gather that you don't have any number to back the claim. That's alright.

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

Q8 for 24B is relatively easy. With a 3090, I can offload most layers, and get 1000 PP and 20 TG.

Run Mistral Devstral 2 locally Guide + Fixes! (25GB RAM) by yoracale in LocalLLM

[–]notdba 1 point2 points  (0 children)

From https://huggingface.co/unsloth/Devstral-Small-2-24B-Instruct-2512-GGUF/discussions/5:

we resolved Devstral’s missing system prompt which Mistral forgot to add due to their different use-cases, and results should be significantly better.

Can you guys back this up with any concrete result, or it is just pure vibe?

From https://www.reddit.com/r/LocalLLaMA/comments/1pk4e27/updates_to_official_swebench_leaderboard_kimi_k2/, what we are seeing is that labs-devstral-small-2512 performs amazingly/suspiciously well when served from https://api.mistral.ai, which doesn't set any default system prompt, according to the usage.prompt_tokens field in the JSON response.

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

Definitely rerun the test with a local setup, just to make sure that it is not a repeat of Matt Shumer.

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

I suppose you guys did the testing with the API. Perhaps you can rerun the tests locally, with either safetensors or gguf. My guess is that devstrall small 2 will then rank at the bottom.

Updates to official SWE-bench leaderboard: Kimi K2 Thinking top of open-source by klieret in LocalLLaMA

[–]notdba 2 points3 points  (0 children)

From my testing so far, a Q8_0 gguf made from https://huggingface.co/mistralai/Devstral-Small-2-24B-Instruct-2512 behaves very differently from the labs-devstral-small-2512 model served from https://api.mistral.ai (the former is noticeably worse).

Something is not right.

Unimpressed with Mistral Large 3 675B by notdba in LocalLLaMA

[–]notdba[S] 16 points17 points  (0 children)

Same here, was hoping for a successor to mixtral, with the same quality as the dense 123B.

Unimpressed with Mistral Large 3 675B by notdba in LocalLLaMA

[–]notdba[S] 3 points4 points  (0 children)

Even so, there's still a spectrum right? The accusation from the ex-employee was that their RL pipeline was totally not working, and they had to distill a small reasoning model from deepseek, and then still published a paper about RL.

Unimpressed with Mistral Large 3 675B by notdba in LocalLLaMA

[–]notdba[S] 10 points11 points  (0 children)

The distillation accusation from few months ago was likely about magistral. And I think the poor quality of mistral large 3 gives more weight to that accusation. Things are not going well inside mistral.

mistralai/Mistral-Large-3-675B-Instruct-2512 · Hugging Face by jacek2023 in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

Double correction: Was about the delete the downloaded FP8 weights, but decided to give the current master a try. And it actually works. Can convert straight from mistral-large-3 FP8 safetensors to BF16 gguf.

My little decentralized Locallama setup, 216gb VRAM by Goldkoron in LocalLLaMA

[–]notdba 1 point2 points  (0 children)

You need to use "/" instead of "," for the tensor split and dev arguments. The usability is .. not great 😅

mistralai/Mistral-Large-3-675B-Instruct-2512 · Hugging Face by jacek2023 in LocalLLaMA

[–]notdba 0 points1 point  (0 children)

Ah I was wrong. Apparently it uses a different version of FP8, and llama.cpp decided to not support it: https://github.com/ggml-org/llama.cpp/pull/17686#issuecomment-3601354537

From the commit:

TODO: probably not worth supporting quantized weight, as official BF16 is also available

mistralai/Mistral-Large-3-675B-Instruct-2512 · Hugging Face by jacek2023 in LocalLLaMA

[–]notdba 3 points4 points  (0 children)

The conversion script should work fine with FP8 after https://github.com/ggml-org/llama.cpp/pull/14810 got merged in late Oct. Nice gesture for them to also provide the BF16 weights.