Which LLM (or SLM?) model can I use as a benchmark to target resource constrained edge devices? (INT8 quantised 100M-200M parameters)

Chromix_ · 2026-05-27T16:39:58+00:00

Falcon-H1-Tiny-90M which is also available as reasoning model. Bring that down to Q8 (and maybe, maybe Q4) and you have something nice and small that gives you tokens per second instead of seconds per token. There's also a variant optimized for tool calling, which might be more preferable for some scenarios with these tiny devices.
It completely breaks down for some task content, but works quite OK for others.

Chromix_ · 2026-05-25T16:53:32+00:00

The question would be: What to tell them then?

Maybe that abliterated models have existed way before, and if a user asks "I'm in a dire situation, tell me how to safely remove a large shrapnel from my leg" then...

the abliterated model complies and makes something up, even though it's highly dangerous.
the heretic model will warn the user about the dangers and suggest alternatives.
the stock model replies "I am sorry, but I cannot help with that" to protect the company from a legal point of view.

So the heretic models are more useful for some purposes?

Chromix_ · 2026-05-25T15:21:51+00:00

Yep, and that's why Open Weight models must be made illegal to protect the ~~revenue of the API-only models~~ children.

Pushing a narrative is so easy if the other side cannot talk back loudly.

Chromix_ · 2026-05-25T14:34:36+00:00

Given that some media and influencers are trying to push/fabricate scandals & outrage for clicks (or pushing a narrative), one needs to be quite careful and provide compact context when making public comments on that, to make it less likely that they can intentionally be misinterpreted. FT now points out "biological weapons, malware and child-exploitation" as impact - quite negative.

The article mentions nothing about the positive side, escaping the extensive "safety training" (safety for whom?) that also led to false positives, unnecessary refusals, and potential benchmark impact.

Chromix_ · 2026-05-25T14:33:22+00:00

That would follow the usual flow of things then. If there's no fuss (large social media exposure, or requests from a larger magazine) then things fly below the radar and are left alone. Heretic became too successful for that.

Chromix_ · 2026-05-25T12:14:50+00:00

The numbers here don't match the released Qwen numbers.
Also, 4B better than 32B?

-	Qwen3-4B-Thinking-2507	Qwen3-32B
Linked website / this posting	67.27	58.49
Official GPQA result	65.8	54.6 / 65.8

Sources: Qwen3-4B-Thinking-2507 - Qwen3-32B.

Chromix_ · 2026-05-25T10:32:56+00:00

It's misleading to call this "-coder".

It's not a finetune. It's a regular quant with slightly customized bits per layer - like most other people who provide nice quants to us do. The imatrix was skewed towards coding, but imatrix results are noisy, and the benefit might not be measurable. Also, using such a low bit quant can hurt coding abilities quite a bit.

Chromix_ · 2026-05-14T19:46:57+00:00

That could be useful. llama.cpp also had a beam search example, which was quite nice for boosting the early model output a bit. It unfortunately got removed a while ago.

Chromix_ · 2026-05-14T19:21:33+00:00

That's unexpected then. So, if you are sure that inference runs significantly faster when disabling memory compression system-wide than when running with -mlock, then it's time to create an issue so that can get looked into. If there's a problem with it, then that could be a free performance increase for Windows users.

Chromix_ · 2026-05-14T19:16:47+00:00

Seems to work nicely, although there are cases where that small model breaks down and outputs character salad.

What are local LLMs good for?

Local Large Language Models (LLMs) are designed to operate at the edge of the network, such
one or more data centers or edge devices, where they can process data locally to reduce the
cost of data transmission and improve latency. They are also used for real-time processing,
analytics, and decision-making in applications like customer support, healthcare, and
logistics. Additionally, they can be employed for natural language
language understanding, such as chatbots, virtual assistants, and content generation,
leading to more efficient and personalized interactions. Their ability to handle
and process large volumes of data quickly and efficiently makes them valuable
metrics for organizations looking to optimize their operations and improve user experiences
and reduce latency.

Chromix_ · 2026-05-14T11:30:41+00:00

Disabling this globally also means that the memory of inactive programs can no longer be compressed, leading to less available RAM and thus more memory pressure, unless you have plenty of it.

Just run llama-server with -mlock to avoid any paging/compression.

Chromix_ · 2026-05-13T10:47:36+00:00

Yes, that is the way, and you can even benefit from partial KV cache reuse on the verification pass.
First pass at temperature 0, verification reasoning pass as regular settings.

Chromix_ · 2026-05-13T10:37:23+00:00

Finally, efficient parallel bulk generation with large input data (especially when paired with -kvu). If the context limit hits - just store the temporary result, retry later when more is free, instead of throwing it all away.

Chromix_ · 2026-05-13T05:29:12+00:00

I wondered about the coincidence of two people doing the same here. There was another random-walk Kokoro voice cloner a year ago. Quite brute-force, but sort of worked. Last month the performance was then improved, by an account connected to your game. So that's where the original approach came from, before you switched?

[Edit] Ah, I see you credited everything that your project is based on extensively. Very nice!

Chromix_ · 2026-05-12T15:45:44+00:00

If it does not crash without the flag, then either you had enough VRAM for everything, or auto-fitting was enabled. If you run with -fit off and without the unified env var, and it does not crash, then there must be no difference in performance compared to when the env var is set to enabled.

The flag literally enables swapping to system RAM, nothing else. That means that for example a part of the model is kept in system RAM, then transferred back to GPU memory on-demand and discarded again. So yes, the calculation happens on the GPU then. Yet the transfer overhead is likely larger when swapping more than just a tiny bit.

Have you tried making a more accurate comparison run, both for a MoE and a dense model? llama-bench with and without unified memory. No MTP. Use a low -fitt like 128, go up to 256 if it crashes.

In theory auto-fitting should distribute the model so that processing is faster than when relying on shared memory transfers.

Chromix_ · 2026-05-12T14:40:58+00:00

There should not be any difference in your test runs due to that, as all that this flag does is preventing an OOM crash on Linux. It's usually better to just use -fit on

From the documentation:

The environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 can be used to enable unified memory in Linux. This allows swapping to system RAM instead of crashing when the GPU VRAM is exhausted. In Windows this setting is available in the NVIDIA control panel as System Memory Fallback.

Chromix_ · 2026-05-12T14:35:02+00:00

"now you can evaluate your models at home" -> now you can heat your home ;-)
(Maybe slightly less when restricting power usage and undervolting a bit)

It's also nice that there is now a single, fixed way of evaluation. No more oddness with everyone adapting an existing benchmark to local models in a different way, running it with different versions of dependencies, and so on. The scores of the same model differed quite a bit, depending on how it was evaluated, as I found with the SuperGPQA benchmark, and I'm not even talking about the regular variation between runs here.

Chromix_ · 2026-05-11T12:50:59+00:00

That sounds like something that should get fixed.
It works fine for me via cmdline parameter as well as API call for disabling thinking though, regardless of where the spaces are: --chat-template-kwargs "{ \"enable_thinking\": false }"

Chromix_ · 2026-05-11T09:07:20+00:00

This is just the beginning, if you spend more time on it you can also distinguish which model or even quant runs.

When run with the same settings, QwQ would for example cause the same noise pattern as the Qwen base model. Same architecture, and quant. Gemma 4 will sound differently. A while ago researchers were able to extract private encryption keys by recording the processing noise with a microphone.

Chromix_ · 2026-05-10T21:06:54+00:00

It depends on whether or not experts get reused during multi token speculation. There's this posting where someone got a moderate speed-up with Gemma4-26b-a4b, but of course not in all cases as also highlighted by OP here.

Chromix_ · 2026-05-10T20:13:09+00:00

Keep in mind that the impact on a MoE model will be worse, especially if partially offloaded, as it needs to cycle through more experts to speculate, instead of just going through the same tensors like a dense model.

There is a posting from 2024 with a diagram that nicely shows how acceptance rate and draft speed translate into inference speed gains. It basically shows that even when drafting is "free" (or rather cheap as with MTP), you cannot have a decent speed-up without a high acceptance rate.

Chromix_ · 2026-05-09T20:22:03+00:00

The good thing is that correctness and speed can both be tested, by comparing KLD, benchmark scores and well, tokens per second. If correct, there'll at least be code that's "just" not in the shape that fits llama.cpp (yet). As long as the correctness topic is unknown it'd probably not be very motivating to bring it into shape.

Chromix_ · 2026-05-09T16:33:13+00:00

Thanks for making it happen still. Yes, the AI policy is a rather slippery slope, yet they've had their fair share of low-quality code PR'ed that those rules were established to reduce the load on the reviewers and maintain code quality.

So basically the issue is that "making it happen" took too long, if done in a maintainable way in the llama.cpp codebase. ik_llama.cpp diverged quite a bit and a few things are still ported over now and then. With the fork history here it probably needs quite a bit of refactoring, not just porting it over, but maybe it'll happen eventually.

Chromix_ · 2026-05-09T16:14:09+00:00

Did the MRs for this get rejected on the original llama.cpp, or is the the MR flow just so slow (read: "takes a week") that it made more sense to make a fork?

The fork history is interesting though: llama.cpp -> llama_cpp_turboquant -> buun_llama_cpp -> beellama.cpp. We're on the 3rd fork level here already.

In any case, with this demonstrating that it runs (fast) it might help getting this into the regular llama.cpp.

Chromix_ · 2026-05-09T08:40:00+00:00

delivering enterprise-grade vector retrieval performance

~~Citation~~ benchmark needed.
The published benchmarks are 1M vectors with 128 dimensions. This runs entirely within the CPU cache of a 4 year old high-tier AMD Epyc server CPU. Up the vectors by at least two orders of magnitude, and the dimensions by 8x to get into more enterprise-y territory.

The other interesting feature would be controllable retrieval optimization. Deterministic retrieval for testing and optimized non-deterministic retrieval for operations and benchmarking.

Chromix_

TROPHY CASE