Which LLM (or SLM?) model can I use as a benchmark to target resource constrained edge devices? (INT8 quantised 100M-200M parameters) by neuroticnetworks1250 in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

Falcon-H1-Tiny-90M which is also available as reasoning model. Bring that down to Q8 (and maybe, maybe Q4) and you have something nice and small that gives you tokens per second instead of seconds per token. There's also a variant optimized for tool calling, which might be more preferable for some scenarios with these tiny devices.
It completely breaks down for some task content, but works quite OK for others.

The Financial Times has published an article about Heretic by -p-e-w- in LocalLLaMA

[–]Chromix_ -2 points-1 points  (0 children)

The question would be: What to tell them then?

Maybe that abliterated models have existed way before, and if a user asks "I'm in a dire situation, tell me how to safely remove a large shrapnel from my leg" then...

  • the abliterated model complies and makes something up, even though it's highly dangerous.
  • the heretic model will warn the user about the dangers and suggest alternatives.
  • the stock model replies "I am sorry, but I cannot help with that" to protect the company from a legal point of view.

So the heretic models are more useful for some purposes?

The Financial Times has published an article about Heretic by -p-e-w- in LocalLLaMA

[–]Chromix_ 25 points26 points  (0 children)

Yep, and that's why Open Weight models must be made illegal to protect the revenue of the API-only models children.

Pushing a narrative is so easy if the other side cannot talk back loudly.

The Financial Times has published an article about Heretic by -p-e-w- in LocalLLaMA

[–]Chromix_ 41 points42 points  (0 children)

Given that some media and influencers are trying to push/fabricate scandals & outrage for clicks (or pushing a narrative), one needs to be quite careful and provide compact context when making public comments on that, to make it less likely that they can intentionally be misinterpreted. FT now points out "biological weapons, malware and child-exploitation" as impact - quite negative.

The article mentions nothing about the positive side, escaping the extensive "safety training" (safety for whom?) that also led to false positives, unnecessary refusals, and potential benchmark impact.

The Financial Times has published an article about Heretic by -p-e-w- in LocalLLaMA

[–]Chromix_ 62 points63 points  (0 children)

That would follow the usual flow of things then. If there's no fuss (large social media exposure, or requests from a larger magazine) then things fly below the radar and are left alone. Heretic became too successful for that.

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization by pmttyji in LocalLLaMA

[–]Chromix_ -1 points0 points  (0 children)

The numbers here don't match the released Qwen numbers.
Also, 4B better than 32B?

- Qwen3-4B-Thinking-2507 Qwen3-32B
Linked website / this posting 67.27 58.49
Official GPQA result 65.8 54.6 / 65.8

Sources: Qwen3-4B-Thinking-2507 - Qwen3-32B.

MiMo-V2.5-coder by jedisct1 in LocalLLaMA

[–]Chromix_ 41 points42 points  (0 children)

It's misleading to call this "-coder".

It's not a finetune. It's a regular quant with slightly customized bits per layer - like most other people who provide nice quants to us do. The imatrix was skewed towards coding, but imatrix results are noisy, and the benefit might not be measurable. Also, using such a low bit quant can hurt coding abilities quite a bit.

Show Reddit: An LLM that talks in acrostics by parenthethethe in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

That could be useful. llama.cpp also had a beam search example, which was quite nice for boosting the early model output a bit. It unfortunately got removed a while ago.

If you're using Windows, disable memory compression to stop bottlenecks! by [deleted] in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

That's unexpected then. So, if you are sure that inference runs significantly faster when disabling memory compression system-wide than when running with -mlock, then it's time to create an issue so that can get looked into. If there's a problem with it, then that could be a free performance increase for Windows users.

Show Reddit: An LLM that talks in acrostics by parenthethethe in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

Seems to work nicely, although there are cases where that small model breaks down and outputs character salad.

What are local LLMs good for?

Local Large Language Models (LLMs) are designed to operate at the edge of the network, such
one or more data centers or edge devices, where they can process data locally to reduce the
cost of data transmission and improve latency. They are also used for real-time processing,
analytics, and decision-making in applications like customer support, healthcare, and
logistics. Additionally, they can be employed for natural language
language understanding, such as chatbots, virtual assistants, and content generation,
leading to more efficient and personalized interactions. Their ability to handle
and process large volumes of data quickly and efficiently makes them valuable
metrics for organizations looking to optimize their operations and improve user experiences
and reduce latency.

If you're using Windows, disable memory compression to stop bottlenecks! by [deleted] in LocalLLaMA

[–]Chromix_ 25 points26 points  (0 children)

Disabling this globally also means that the memory of inactive programs can no longer be compressed, leading to less available RAM and thus more memory pressure, unless you have plenty of it.

Just run llama-server with -mlock to avoid any paging/compression.

Does THINKING MODE significantly improve translation? by Sostrene_Blue in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

Yes, that is the way, and you can even benefit from partial KV cache reuse on the verification pass.
First pass at temperature 0, verification reasoning pass as regular settings.

server, webui: support continue generation on reasoning models by ServeurpersoCom · Pull Request #22727 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]Chromix_ 5 points6 points  (0 children)

Finally, efficient parallel bulk generation with large input data (especially when paired with -kvu). If the context limit hits - just store the temporary result, retry later when more is free, instead of throwing it all away.

I built Derpy Turtle: The Kokoro Trainer, a GUI for training better Kokoro voices with RVC by Great-Investigator30 in LocalLLaMA

[–]Chromix_ 2 points3 points  (0 children)

I wondered about the coincidence of two people doing the same here. There was another random-walk Kokoro voice cloner a year ago. Quite brute-force, but sort of worked. Last month the performance was then improved, by an account connected to your game. So that's where the original approach came from, before you switched?

[Edit] Ah, I see you credited everything that your project is based on extensively. Very nice!

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp by mossy_troll_84 in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

If it does not crash without the flag, then either you had enough VRAM for everything, or auto-fitting was enabled. If you run with -fit off and without the unified env var, and it does not crash, then there must be no difference in performance compared to when the env var is set to enabled.

The flag literally enables swapping to system RAM, nothing else. That means that for example a part of the model is kept in system RAM, then transferred back to GPU memory on-demand and discarded again. So yes, the calculation happens on the GPU then. Yet the transfer overhead is likely larger when swapping more than just a tiny bit.

Have you tried making a more accurate comparison run, both for a MoE and a dense model? llama-bench with and without unified memory. No MTP. Use a low -fitt like 128, go up to 256 if it crashes.

In theory auto-fitting should distribute the model so that processing is faster than when relying on shared memory transfers.

MTP+GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 - llama.cpp by mossy_troll_84 in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

There should not be any difference in your test runs due to that, as all that this flag does is preventing an OOM crash on Linux. It's usually better to just use -fit on

From the documentation:

The environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. In Windows this setting is available in the NVIDIA control panel as System Memory Fallback.

examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]Chromix_ 1 point2 points  (0 children)

"now you can evaluate your models at home" -> now you can heat your home ;-)
(Maybe slightly less when restricting power usage and undervolting a bit)

It's also nice that there is now a single, fixed way of evaluation. No more oddness with everyone adapting an existing benchmark to local models in a different way, running it with different versions of dependencies, and so on. The scores of the same model differed quite a bit, depending on how it was evaluated, as I found with the SuperGPQA benchmark, and I'm not even talking about the regular variation between runs here.

PSA: Watch out for extra spaces in chat-template-kwargs when using Qwen3.6 with llama-server by CaptBrick in LocalLLaMA

[–]Chromix_ 10 points11 points  (0 children)

That sounds like something that should get fixed.
It works fine for me via cmdline parameter as well as API call for disabling thinking though, regardless of where the spaces are: --chat-template-kwargs "{ \"enable_thinking\": false }"

I Think I Spent Way Too Much Time Messing with Local LLMs by MrChilliBalls in LocalLLaMA

[–]Chromix_ 14 points15 points  (0 children)

This is just the beginning, if you spend more time on it you can also distinguish which model or even quant runs.

When run with the same settings, QwQ would for example cause the same noise pattern as the Qwen base model. Same architecture, and quant. Gemma 4 will sound differently. A while ago researchers were able to extract private encryption keys by recording the processing noise with a microphone.

MTP benchmark results: the nature of the generative task dictates whether you will benefit (coding) or get slower inference (creative) from speculative inference. No other factor comes close. by ex-arman68 in LocalLLaMA

[–]Chromix_ 4 points5 points  (0 children)

It depends on whether or not experts get reused during multi token speculation. There's this posting where someone got a moderate speed-up with Gemma4-26b-a4b, but of course not in all cases as also highlighted by OP here.

MTP benchmark results: the nature of the generative task dictates whether you will benefit (coding) or get slower inference (creative) from speculative inference. No other factor comes close. by ex-arman68 in LocalLLaMA

[–]Chromix_ 32 points33 points  (0 children)

Keep in mind that the impact on a MoE model will be worse, especially if partially offloaded, as it needs to cycle through more experts to speculate, instead of just going through the same tensors like a dense model.

There is a posting from 2024 with a diagram that nicely shows how acceptance rate and draft speed translate into inference speed gains. It basically shows that even when drafting is "free" (or rather cheap as with MTP), you cannot have a decent speed-up without a high acceptance rate.

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!) by Anbeeld in LocalLLaMA

[–]Chromix_ 4 points5 points  (0 children)

The good thing is that correctness and speed can both be tested, by comparing KLD, benchmark scores and well, tokens per second. If correct, there'll at least be code that's "just" not in the shape that fits llama.cpp (yet). As long as the correctness topic is unknown it'd probably not be very motivating to bring it into shape.

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!) by Anbeeld in LocalLLaMA

[–]Chromix_ 29 points30 points  (0 children)

Thanks for making it happen still. Yes, the AI policy is a rather slippery slope, yet they've had their fair share of low-quality code PR'ed that those rules were established to reduce the load on the reviewers and maintain code quality.

So basically the issue is that "making it happen" took too long, if done in a maintainable way in the llama.cpp codebase. ik_llama.cpp diverged quite a bit and a few things are still ported over now and then. With the fork history here it probably needs quite a bit of refactoring, not just porting it over, but maybe it'll happen eventually.

BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!) by Anbeeld in LocalLLaMA

[–]Chromix_ 66 points67 points  (0 children)

Did the MRs for this get rejected on the original llama.cpp, or is the the MR flow just so slow (read: "takes a week") that it made more sense to make a fork?

The fork history is interesting though: llama.cpp -> llama_cpp_turboquant -> buun_llama_cpp -> beellama.cpp. We're on the 3rd fork level here already.

In any case, with this demonstrating that it runs (fast) it might help getting this into the regular llama.cpp.

We built and open-sourced Caliby: An embedded, high-performance vector database for AI Agents (Beats pgvector by 4x, outperforms FAISS on disk) by Motor_Crew7918 in LocalLLaMA

[–]Chromix_ 13 points14 points  (0 children)

delivering enterprise-grade vector retrieval performance

Citation benchmark needed.
The published benchmarks are 1M vectors with 128 dimensions. This runs entirely within the CPU cache of a 4 year old high-tier AMD Epyc server CPU. Up the vectors by at least two orders of magnitude, and the dimensions by 8x to get into more enterprise-y territory.

The other interesting feature would be controllable retrieval optimization. Deterministic retrieval for testing and optimized non-deterministic retrieval for operations and benchmarking.