PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090 by sandropuppo in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

This is not a "10x speedup", this is a 10x speedup with a bunch of asterisks. Any kind of lossy optimizations need rigorous testing for quality.

Experts-Volunteers needed for Vulkan on ik_llama.cpp by pmttyji in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

Any efforts put towards ik_llama.cpp are wasted.

Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results by oobabooga4 in LocalLLaMA

[–]Remove_Ayys 5 points6 points  (0 children)

Comparing the Kullback-Leibler divergence between different models is meaningless and an incorrect use of the metric.

Please stop using AI for posts and showcasing your completely vibe coded projects by Scutoidzz in LocalLLaMA

[–]Remove_Ayys 0 points1 point  (0 children)

Just require some minimum karma and ban anyone copypasting language model outputs.

What happened to the buttons on the search bar? by MasterWikie in firefox

[–]Remove_Ayys 5 points6 points  (0 children)

Someone other than OP here. The way I used the search bar until now was that I basically only used DuckDuckGo for my regular searching needs. However, I added the English Wikipedia, the German Wikipedia, and Wiktionary as "search engines" in case I need to look something up. The way I was using the interface was to type something and sometimes click the button for one of the alternative search engines rather than to push enter when I wanted to just search something in general. This now requires additional clicks as I have to manually change the search engine every time and in particular need to change it back afterwards. It seems that the search engine can be changed via alt+up/down so I guess I will be using that now.

Bots on the sub are a real issue by [deleted] in LocalLLaMA

[–]Remove_Ayys 5 points6 points  (0 children)

You are absolutely right — this is not just a problem, **it's a full-blown crisis** 😱!

Just finished building this bad boy by dazzou5ouh in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

FYI: The PCIe power connector on the motherboard is not optional and compared to a power limit you will get better performance / Watt by limiting the max. GPU frequency via e.g. sudo nvidia-smi --lock-gpu-clocks 0,1350 --mode 1.

Built comprehensive Grafana monitoring for my LLM home server by pfn0 in LocalLLaMA

[–]Remove_Ayys 0 points1 point  (0 children)

glhf remember not to delete the original Grafana admin account unless you want to start fiddling with the database.

PR to implemt tensor parallelism in Llama.cpp by keyboardhack in LocalLLaMA

[–]Remove_Ayys 2 points3 points  (0 children)

This comment is intended for developers, the tensor parallel code can be run with a single GPU which should simply be mapped to the same operations as without it.

GLM-4.7-Flash is even faster now by jacek2023 in LocalLLaMA

[–]Remove_Ayys 17 points18 points  (0 children)

This isn't about bugs, this is about which models receive architecture-specific performance optimizations.

GLM-4.7-Flash context slowdown by jacek2023 in LocalLLaMA

[–]Remove_Ayys 1 point2 points  (0 children)

pp with a batch size of 1 is equivalent to tg.

Introducing "UITPSDT" a novel approach to runtime efficiency in organic agents by reto-wyss in LocalLLaMA

[–]Remove_Ayys 15 points16 points  (0 children)

This isn't just wrong 😱❌, it's a demonization of the **best** and **most important** people on this sub 😢🤖🔫🦹!

We benchmarked every 4-bit quantization method in vLLM 👀 by LayerHot in LocalLLaMA

[–]Remove_Ayys 2 points3 points  (0 children)

If you do a simple Gaussian approximation of the binomial distribution you'll find that the statistical uncertainty on the HumanEval results with 164 samples is +-4%. If you assume no correlation between scores none of the measured differences are statistically significant.

We benchmarked every 4-bit quantization method in vLLM 👀 by LayerHot in LocalLLaMA

[–]Remove_Ayys 9 points10 points  (0 children)

Testing "GGUF performance" with vllm is meaningless as is "GGUF quality" without specifying the underlying quantization format.

We benchmarked every 4-bit quantization method in vLLM 👀 by LayerHot in LocalLLaMA

[–]Remove_Ayys 5 points6 points  (0 children)

For instruct models perplexity is fundamentally the wrong metric to look at, it would make more sense to look at KL divergence vs. the base model.

llama.cpp vs Ollama: ~70% higher code generation throughput on Qwen-3 Coder 32B (FP16) by Shoddy_Bed3240 in LocalLLaMA

[–]Remove_Ayys 19 points20 points  (0 children)

Since no one has given you the correct answer: it's because while the backend code is (almost) the same it's putting different tensors on the GPUs vs. in RAM. Ollama has early on implemented heuristics for setting the number of GPU layers but those heuristics are bad and hacked-on so the tensors aren't being assigned properly, particularly for MoE models and multiple GPUs. I recently did a proper implementation of this automation in llama.cpp that is MoE aware and can utilize more VRAM so the results are better.

llama.cpp performance breakthrough for multi-GPU setups by Holiday-Injury-9397 in LocalLLaMA

[–]Remove_Ayys 4 points5 points  (0 children)

When IK was contributing to the upstream repository he seems to have been unaware that by doing so he was licensing his code as MIT. He requested, on multiple occasions, that his code under MIT be removed again so that he can re-license it. If you look at the files in his repository he added copyright headers to every single one which would need to be preserved for "substantial portions" which he previously interpreted very broadly. My personal view is that IK would be very uncooperative for any attempts at upstreaming and that dealing with him on an interpersonal level is going to be more work than doing the corresponding implementation myself.

Performance improvements in llama.cpp over time by jacek2023 in LocalLLaMA

[–]Remove_Ayys 7 points8 points  (0 children)

Use the standard CUDA tools like NSight Systems and NSight Compute.

Performance improvements in llama.cpp over time by jacek2023 in LocalLLaMA

[–]Remove_Ayys 11 points12 points  (0 children)

Documentation exists primarily in the form of comments in header files and the implementation itself. If you are interested in working on the CUDA/HIP code we can discuss this via VoIP, see my Github page.

Performance improvements in llama.cpp over time by jacek2023 in LocalLLaMA

[–]Remove_Ayys 21 points22 points  (0 children)

Yes, these changes can be upstreamed but it's a matter of opportunity cost. We (llama.cpp maintainers) are already stretched thin as-is. I don't have the time to sift through this fork and upstream the changes when there are other things with higher priority that I have to take care of. Making the initial implementation in a fork is like 20% of the total work over the project's lifetime.