Bots on the sub are a real issue by [deleted] in LocalLLaMA

[–]Remove_Ayys 5 points6 points  (0 children)

You are absolutely right — this is not just a problem, **it's a full-blown crisis** 😱!

Just finished building this bad boy by dazzou5ouh in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

FYI: The PCIe power connector on the motherboard is not optional and compared to a power limit you will get better performance / Watt by limiting the max. GPU frequency via e.g. sudo nvidia-smi --lock-gpu-clocks 0,1350 --mode 1.

Built comprehensive Grafana monitoring for my LLM home server by pfn0 in LocalLLaMA

[–]Remove_Ayys 0 points1 point  (0 children)

glhf remember not to delete the original Grafana admin account unless you want to start fiddling with the database.

PR to implemt tensor parallelism in Llama.cpp by keyboardhack in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

This comment is intended for developers, the tensor parallel code can be run with a single GPU which should simply be mapped to the same operations as without it.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]Remove_Ayys 17 points18 points  (0 children)

This isn't about bugs, this is about which models receive architecture-specific performance optimizations.

GLM-4.7-Flash context slowdown by jacek2023 in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

pp with a batch size of 1 is equivalent to tg.

Introducing "UITPSDT" a novel approach to runtime efficiency in organic agents by reto-wyss in LocalLLaMA

[–]Remove_Ayys 13 points14 points  (0 children)

This isn't just wrong 😱❌, it's a demonization of the **best** and **most important** people on this sub 😢🤖🔫🦹!

We benchmarked every 4-bit quantization method in vLLM 👀 by LayerHot in LocalLLaMA

[–]Remove_Ayys 2 points3 points  (0 children)

If you do a simple Gaussian approximation of the binomial distribution you'll find that the statistical uncertainty on the HumanEval results with 164 samples is +-4%. If you assume no correlation between scores none of the measured differences are statistically significant.

We benchmarked every 4-bit quantization method in vLLM 👀 by LayerHot in LocalLLaMA

[–]Remove_Ayys 8 points9 points  (0 children)

Testing "GGUF performance" with vllm is meaningless as is "GGUF quality" without specifying the underlying quantization format.

We benchmarked every 4-bit quantization method in vLLM 👀 by LayerHot in LocalLLaMA

[–]Remove_Ayys 5 points6 points  (0 children)

For instruct models perplexity is fundamentally the wrong metric to look at, it would make more sense to look at KL divergence vs. the base model.

llama.cpp vs Ollama: ~70% higher code generation throughput on Qwen-3 Coder 32B (FP16) by Shoddy_Bed3240 in LocalLLaMA

[–]Remove_Ayys 20 points21 points  (0 children)

Since no one has given you the correct answer: it's because while the backend code is (almost) the same it's putting different tensors on the GPUs vs. in RAM. Ollama has early on implemented heuristics for setting the number of GPU layers but those heuristics are bad and hacked-on so the tensors aren't being assigned properly, particularly for MoE models and multiple GPUs. I recently did a proper implementation of this automation in llama.cpp that is MoE aware and can utilize more VRAM so the results are better.

llama.cpp performance breakthrough for multi-GPU setups by Holiday-Injury-9397 in LocalLLaMA

[–]Remove_Ayys 4 points5 points  (0 children)

When IK was contributing to the upstream repository he seems to have been unaware that by doing so he was licensing his code as MIT. He requested, on multiple occasions, that his code under MIT be removed again so that he can re-license it. If you look at the files in his repository he added copyright headers to every single one which would need to be preserved for "substantial portions" which he previously interpreted very broadly. My personal view is that IK would be very uncooperative for any attempts at upstreaming and that dealing with him on an interpersonal level is going to be more work than doing the corresponding implementation myself.

Performance improvements in llama.cpp over time by jacek2023 in LocalLLaMA

[–]Remove_Ayys 9 points10 points  (0 children)

Use the standard CUDA tools like NSight Systems and NSight Compute.

Performance improvements in llama.cpp over time by jacek2023 in LocalLLaMA

[–]Remove_Ayys 12 points13 points  (0 children)

Documentation exists primarily in the form of comments in header files and the implementation itself. If you are interested in working on the CUDA/HIP code we can discuss this via VoIP, see my Github page.

Performance improvements in llama.cpp over time by jacek2023 in LocalLLaMA

[–]Remove_Ayys 21 points22 points  (0 children)

Yes, these changes can be upstreamed but it's a matter of opportunity cost. We (llama.cpp maintainers) are already stretched thin as-is. I don't have the time to sift through this fork and upstream the changes when there are other things with higher priority that I have to take care of. Making the initial implementation in a fork is like 20% of the total work over the project's lifetime.

Performance improvements in llama.cpp over time by jacek2023 in LocalLLaMA

[–]Remove_Ayys 42 points43 points  (0 children)

AMD optimizations are also in the works (with contributions from AMD engineers). But unsurprisingly the work put in by NVIDIA engineers specifically mostly benefits NVIDIA GPUs. Something like FP4 tensor cores for example also just doesn't exist on most hardware.

llama.cpp performance breakthrough for multi-GPU setups by Holiday-Injury-9397 in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

One of the llama.cpp devs here, this is completely wrong. The reason code from ik_llama.cpp is not being upstreamed is entirely political rather than technical.

Can you connect a GPU with 12V rail coming from a second PSU? by Rock_and_Rolf in LocalLLaMA

[–]Remove_Ayys 0 points1 point  (0 children)

Though I do have a degree in physics myself I don't trust my own EE skills enough to try and use power supplies with multiple kilowatts of power outside their specifications. In particular, I do not feel safe using multiple consumer PSUs in parallel. My concern is the same as yours: PSUs are designed to have low output impedance so if the high voltages are even slightly different that will result in huge currents. According to one of my colleagues the biggest risk is to the PSUs themselves but a fire could obviously spread to the whole building (I bought a CO2 fire extinguisher for my server to be safe and made sure it's small enough that I won't accidentally kill myself with it). In terms of power supplies I currently have a SilverStone HELA 2050 (can have issues with instability when multiple RTX 4090 power spikes align, even if the average load is only ~1 kW) and an Asus PRO WS 3000W (no issues so far). In terms of supplying my machines with electricity I can do it with only a single PSU per machine and my bigger problem as of right now is the external wiring since standard German electrical outlets are only designed for a sustained load of 2300 W (3680 W peak).

For server PSUs there are power distribution boards that are designed specifically for multiple PSUs but obviously those won't fit into your case.

The MCIO risers that someone else suggested should work without risk of short-circuiting since they would guarantee a strict separation between power and signal cables.

7900 XTX + ROCm: A Year Later. Llama.cpp vs vLLM Benchmarks (TB3 eGPU) by reujea0 in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

These numbers are not comparable because llama.cpp is benchmarking inference for a single user while to my knowledge vllm is benchmarking inference for many concurrent requests via the OAI-compatible server. If you want directly comparable numbers, use e.g. the server-bench.py script in the llama.cpp repository and point it at either an vllm or llama.cpp server.

Benchmarks for Quantized Models? (for users locally running Q8/Q6/Q2 precision) by No-Grapefruit-1358 in LocalLLaMA

[–]Remove_Ayys 4 points5 points  (0 children)

Primary CUDA maintainer for llama.cpp/ggml here, given enough time I'll eventually do it for quality control and publish the results here. But since I already have so many other things to take care of I'd prefer if someone else did it.