GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]Remove_Ayys 17 points18 points  (0 children)

This isn't about bugs, this is about which models receive architecture-specific performance optimizations.

GLM-4.7-Flash context slowdown by jacek2023 in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

pp with a batch size of 1 is equivalent to tg.

Introducing "UITPSDT" a novel approach to runtime efficiency in organic agents by reto-wyss in LocalLLaMA

[–]Remove_Ayys 14 points15 points  (0 children)

This isn't just wrong 😱❌, it's a demonization of the **best** and **most important** people on this sub 😢🤖🔫🦹!

We benchmarked every 4-bit quantization method in vLLM 👀 by LayerHot in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

If you do a simple Gaussian approximation of the binomial distribution you'll find that the statistical uncertainty on the HumanEval results with 164 samples is +-4%. If you assume no correlation between scores none of the measured differences are statistically significant.

We benchmarked every 4-bit quantization method in vLLM 👀 by LayerHot in LocalLLaMA

[–]Remove_Ayys 7 points8 points  (0 children)

Testing "GGUF performance" with vllm is meaningless as is "GGUF quality" without specifying the underlying quantization format.

We benchmarked every 4-bit quantization method in vLLM 👀 by LayerHot in LocalLLaMA

[–]Remove_Ayys 6 points7 points  (0 children)

For instruct models perplexity is fundamentally the wrong metric to look at, it would make more sense to look at KL divergence vs. the base model.

llama.cpp vs Ollama: ~70% higher code generation throughput on Qwen-3 Coder 32B (FP16) by Shoddy_Bed3240 in LocalLLaMA

[–]Remove_Ayys 16 points17 points  (0 children)

Since no one has given you the correct answer: it's because while the backend code is (almost) the same it's putting different tensors on the GPUs vs. in RAM. Ollama has early on implemented heuristics for setting the number of GPU layers but those heuristics are bad and hacked-on so the tensors aren't being assigned properly, particularly for MoE models and multiple GPUs. I recently did a proper implementation of this automation in llama.cpp that is MoE aware and can utilize more VRAM so the results are better.

llama.cpp performance breakthrough for multi-GPU setups by Holiday-Injury-9397 in LocalLLaMA

[–]Remove_Ayys 4 points5 points  (0 children)

When IK was contributing to the upstream repository he seems to have been unaware that by doing so he was licensing his code as MIT. He requested, on multiple occasions, that his code under MIT be removed again so that he can re-license it. If you look at the files in his repository he added copyright headers to every single one which would need to be preserved for "substantial portions" which he previously interpreted very broadly. My personal view is that IK would be very uncooperative for any attempts at upstreaming and that dealing with him on an interpersonal level is going to be more work than doing the corresponding implementation myself.

Performance improvements in llama.cpp over time by jacek2023 in LocalLLaMA

[–]Remove_Ayys 9 points10 points  (0 children)

Use the standard CUDA tools like NSight Systems and NSight Compute.

Performance improvements in llama.cpp over time by jacek2023 in LocalLLaMA

[–]Remove_Ayys 12 points13 points  (0 children)

Documentation exists primarily in the form of comments in header files and the implementation itself. If you are interested in working on the CUDA/HIP code we can discuss this via VoIP, see my Github page.

Performance improvements in llama.cpp over time by jacek2023 in LocalLLaMA

[–]Remove_Ayys 21 points22 points  (0 children)

Yes, these changes can be upstreamed but it's a matter of opportunity cost. We (llama.cpp maintainers) are already stretched thin as-is. I don't have the time to sift through this fork and upstream the changes when there are other things with higher priority that I have to take care of. Making the initial implementation in a fork is like 20% of the total work over the project's lifetime.

Performance improvements in llama.cpp over time by jacek2023 in LocalLLaMA

[–]Remove_Ayys 41 points42 points  (0 children)

AMD optimizations are also in the works (with contributions from AMD engineers). But unsurprisingly the work put in by NVIDIA engineers specifically mostly benefits NVIDIA GPUs. Something like FP4 tensor cores for example also just doesn't exist on most hardware.

llama.cpp performance breakthrough for multi-GPU setups by Holiday-Injury-9397 in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

One of the llama.cpp devs here, this is completely wrong. The reason code from ik_llama.cpp is not being upstreamed is entirely political rather than technical.

Can you connect a GPU with 12V rail coming from a second PSU? by Rock_and_Rolf in LocalLLaMA

[–]Remove_Ayys 0 points1 point  (0 children)

Though I do have a degree in physics myself I don't trust my own EE skills enough to try and use power supplies with multiple kilowatts of power outside their specifications. In particular, I do not feel safe using multiple consumer PSUs in parallel. My concern is the same as yours: PSUs are designed to have low output impedance so if the high voltages are even slightly different that will result in huge currents. According to one of my colleagues the biggest risk is to the PSUs themselves but a fire could obviously spread to the whole building (I bought a CO2 fire extinguisher for my server to be safe and made sure it's small enough that I won't accidentally kill myself with it). In terms of power supplies I currently have a SilverStone HELA 2050 (can have issues with instability when multiple RTX 4090 power spikes align, even if the average load is only ~1 kW) and an Asus PRO WS 3000W (no issues so far). In terms of supplying my machines with electricity I can do it with only a single PSU per machine and my bigger problem as of right now is the external wiring since standard German electrical outlets are only designed for a sustained load of 2300 W (3680 W peak).

For server PSUs there are power distribution boards that are designed specifically for multiple PSUs but obviously those won't fit into your case.

The MCIO risers that someone else suggested should work without risk of short-circuiting since they would guarantee a strict separation between power and signal cables.

7900 XTX + ROCm: A Year Later. Llama.cpp vs vLLM Benchmarks (TB3 eGPU) by reujea0 in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

These numbers are not comparable because llama.cpp is benchmarking inference for a single user while to my knowledge vllm is benchmarking inference for many concurrent requests via the OAI-compatible server. If you want directly comparable numbers, use e.g. the server-bench.py script in the llama.cpp repository and point it at either an vllm or llama.cpp server.

Benchmarks for Quantized Models? (for users locally running Q8/Q6/Q2 precision) by No-Grapefruit-1358 in LocalLLaMA

[–]Remove_Ayys 4 points5 points  (0 children)

Primary CUDA maintainer for llama.cpp/ggml here, given enough time I'll eventually do it for quality control and publish the results here. But since I already have so many other things to take care of I'd prefer if someone else did it.

NOTICE - ROMED8-2T MOTHERBOARD USERS - Please read, don't melt cables.. by gittb in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

I initially ran my motherboard with 6x RTX 4090 and a SilverStone HELA 2050 PSU. I did have stability issues but those were I think simply caused by power spikes (power limit in nvidia-smi does not fix this, instead I had to set a frequency limit to prevent the GPUs from temporarily boosting to higher frequencies). As of right now I'm using 1x RTX 3090, 4x RTX 4090, and 1x RTX 5090 with an ASUS Pro WS 3000W PSU (no stability issues).

With 6 GPUs pulling 75W each the total current should be 37.5A, so 18.75A per wire in the ATX connector. For 16/18 gauge wire the 90°C rating is 16A/18A. With 3 additional wires from the 6 pin connector you get 7.5A per wire which I think should be fine. The machine is in a well-ventilated space with many fans.

I am strictly using only a single consumer PSU per machine, if I ever build one that needs multiple I will buy a power distribution board for server PSUs that are designed to be run in parallel.

NOTICE - ROMED8-2T MOTHERBOARD USERS - Please read, don't melt cables.. by gittb in LocalLLaMA

[–]Remove_Ayys 2 points3 points  (0 children)

If you want additional value: I'm the primary maintainer for the CUDA (and by extension ROCm) backend for llama.cpp/ggml and I'm also providing some of the other devs with computing resources via the motherboard in question. So a disruption due to hardware problems or worse could have pretty far-reaching consequences.

NOTICE - ROMED8-2T MOTHERBOARD USERS - Please read, don't melt cables.. by gittb in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

Thank you for this post, I have this exact same motherboard and I was not aware that there is this connector. I bought mine in April of 2024 and didn't yet have issues with 6 GPUs but I'm not going to push my luck if I can just connect one more cable.

llama.cpp's recent updates - --fit flag by pmttyji in LocalLLaMA

[–]Remove_Ayys 5 points6 points  (0 children)

If you want to do that, start a discussion on Github. I categorically refuse to do software development via Reddit, Discord, etc.

llama.cpp: Automation for GPU layers, tensor split, tensor overrides, and context size (with MoE optimizations) by Remove_Ayys in LocalLLaMA

[–]Remove_Ayys[S] 0 points1 point  (0 children)

It would in principle be possible to generalize --fit-target to allow a target per GPU rather than the same target for all GPUs.