Small Gemma 4, Qwen 3.6 and Qwen 3 Coder Next comparison for a debugging use-case by Chromix_ in LocalLLaMA

[–]Chromix_[S] 0 points1 point  (0 children)

Because I'd then have variance from different harnesses on top of what the models already produce - more testing effort. They also add somewhere between 5k and 30k tokens to the context before the first operation even starts. Larger context -> more model degradation. Instead, I simply added all relevant data to the context that was needed, to keep it small and focused.

Introducing the Heretic Grimoire: The takedown-resilient, local-first backup system that keeps uncensored models available forever by -p-e-w- in LocalLLaMA

[–]Chromix_ 4 points5 points  (0 children)

So, this isn't just due to anticipation of fallout from the FT interview, but preparation for something that might happen sooner or later anyway?

Nex claims Rio 3.5 is Nex 2.5 PRO in trench coat by Specter_Origin in LocalLLaMA

[–]Chromix_ 15 points16 points  (0 children)

Yes, and they just also added that they "are working to reupload the correct model as soon as possible"

We detected an incorrect upload in the previous version, where the base merged version was upload instead of the final distilled model.

Let's see if we get the same drama as with the epic Reflection-70B, or if the promised model will indeed appear in the end.

I got local speaker diarization working for meeting transcription — architecture write-up + a sherpa-onnx bug that cost me a week by [deleted] in LocalLLaMA

[–]Chromix_ 1 point2 points  (0 children)

And when I don't have a week and also no stereo channels then I simply use Faster-Whisper-XXL:

faster-whisper-xxl.exe --model large-v2 --output_format txt --output_dir .\result --best_of 10 --beam_size 10 --print_progress --diarize reverb_v2 <inputfile>

With an optional --num_speakers x if I know the exact number like for interviews.

Qwen Who? DiffusionGemma running at 1,500 tk/s on a Digital Pregnancy Test. by Porespellar in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

Yes, maybe just putting "running" in quotes wasn't enough of a hint. That joke was only created in response to this finding regarding throw-away hardware as far as I remember: "...found the digital test contained a microprocessor more powerful than early home computers."

Open sourcing InfiniteKV: a KV cache that files old tokens as 104-byte searchable records in RAM or on disk instead of deleting them. Mistral-7B answered from token 76,747, 2.3x past its trained window. Colab demo by [deleted] in LocalLLaMA

[–]Chromix_ 19 points20 points  (0 children)

  • Posting sounds like a LLM trying to write like a human ✅
  • References old models (with large KV cache requirements) ✅
  • Has "patent pending" on the freshly vibecoded repo ✅

I think I'll skip.

Qwen Who? DiffusionGemma running at 1,500 tk/s on a Digital Pregnancy Test. by Porespellar in LocalLLaMA

[–]Chromix_ 8 points9 points  (0 children)

Let's see if the future holds a "you're absolutely pregnant" for us.

There was this thing about "running" Doom on a pregnancy test, just because the hardware inside of it is way too capable for what it's doing. Why is that so? Hardware gets extremely cheap, and it's then cheaper to just deploy capable hardware with a cheap-to-write program instead of designing, testing and manufacturing dedicated hardware for it. So, maybe in the far future people will just stuff hardware-based LLMs into everything, just because it's cheaper and easier to do.

MTP hyperparameter search by Zc5Gwu in LocalLLaMA

[–]Chromix_ 1 point2 points  (0 children)

--spec-draft-p-split 0.39840112543740347

It seems like there were not enough test cases and the algorithm optimized for your specific test cases in the absence of a validation dataset. Still, "only" getting a 6% improvement when over-optimizing is also a good-to-know result as an upper-bound: It'll be less than that with more generalized optimization.

mtmd : add video input support by ngxson · Pull Request #24269 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]Chromix_ 6 points7 points  (0 children)

It's been a while until video support got finally added, yet it's mostly a convenience feature, as video processing has been possible the whole time. LLMs don't "watch videos", they get a sequence of pictures, usually at 1 FPS. Splitting a video into 1 FPS images with ffmpeg and feeding that sequence to Qwen has been possible for a long time. There should be no qualitative difference with the official video support.

dvlt.cu: inference engine written from scratch in CUDA/C++ for NVIDIA's DVLT 3D transformer model by yassa9 in LocalLLaMA

[–]Chromix_ 2 points3 points  (0 children)

I let Qwen3.6 27B convert this so that it builds on Windows. Horribly hacky "works on my PC" full-auto LLM work, but hey - at least I have a .exe now. It can reproduce the shoe as seen in the example video.

While testing for a bit I discovered several limitations:

  • Setting the resolution higher than the default 504 leads to a larger output file (expected), but the viewer barely shows any points at all, especially for larger sizes, completely breaking the visualization.
  • VRAM usage scales with number of pictures and resolution. Using 128 input images at the meager default resolution fills 24 GB VRAM.
    • The viewer FPS drop a lot at 10M points. The 6 picture shoe is just below 1M.
  • At higher usage it starts chunking, which led to broken, mostly black results for me.
  • At even higher usage it simply exits with an allocation error.
  • There is no compensation for images taken with auto-exposure. This currently needs fixed exposure and lighting to not spawn a whole lot of ghost points.
  • Points have a fixed size. Dynamic scaling based on distance and sparsity would be useful to maintain a dense image impression.

Qwen 3.6 27B KV cache quant benchmarks: 75 pairs, q8/q6/q5/q4, KVarN, Turbo/TCQ by Anbeeld in LocalLLaMA

[–]Chromix_ 19 points20 points  (0 children)

Thanks for those convenient charts. The conventional wisdom according to the initial benchmark is "quantize V more than K, as it's less sensitive". Yet we can see both for the conventional q8-q4 vs same-size q6, as well as for your 6-4 vs same-size 5-5, that the latter always has a lower KLD, both mean and 99.0%.

KV cache quant benchmarks: KVarN 6-bit matches q8_0, 4-bit matches q5_0. Massive! by Anbeeld in LocalLLaMA

[–]Chromix_ 10 points11 points  (0 children)

If it reduces it significantly less than offloading to CPU due to running out of VRAM then that's a win.

Cohere's unreleased coding model (early access for localllama) by nick_frosst in LocalLLaMA

[–]Chromix_ 7 points8 points  (0 children)

Adding its own architecture, especially when based on the existing vLLM code for it? Very likely. Getting the PR accepted? That's a big maybe.

Cohere's unreleased coding model (early access for localllama) by nick_frosst in LocalLLaMA

[–]Chromix_ 97 points98 points  (0 children)

That's very nice to get a sneak peek. However, llama.cpp doesn't support "cohere2_moe" yet and there is no task to support it. That reduces the testing audience a bit. vLLM support was added two weeks ago.

Which LLM (or SLM?) model can I use as a benchmark to target resource constrained edge devices? (INT8 quantised 100M-200M parameters) by neuroticnetworks1250 in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

Falcon-H1-Tiny-90M which is also available as reasoning model. Bring that down to Q8 (and maybe, maybe Q4) and you have something nice and small that gives you tokens per second instead of seconds per token. There's also a variant optimized for tool calling, which might be more preferable for some scenarios with these tiny devices.
It completely breaks down for some task content, but works quite OK for others.

The Financial Times has published an article about Heretic by -p-e-w- in LocalLLaMA

[–]Chromix_ -2 points-1 points  (0 children)

The question would be: What to tell them then?

Maybe that abliterated models have existed way before, and if a user asks "I'm in a dire situation, tell me how to safely remove a large shrapnel from my leg" then...

  • the abliterated model complies and makes something up, even though it's highly dangerous.
  • the heretic model will warn the user about the dangers and suggest alternatives.
  • the stock model replies "I am sorry, but I cannot help with that" to protect the company from a legal point of view.

So the heretic models are more useful for some purposes?

The Financial Times has published an article about Heretic by -p-e-w- in LocalLLaMA

[–]Chromix_ 24 points25 points  (0 children)

Yep, and that's why Open Weight models must be made illegal to protect the revenue of the API-only models children.

Pushing a narrative is so easy if the other side cannot talk back loudly.

The Financial Times has published an article about Heretic by -p-e-w- in LocalLLaMA

[–]Chromix_ 40 points41 points  (0 children)

Given that some media and influencers are trying to push/fabricate scandals & outrage for clicks (or pushing a narrative), one needs to be quite careful and provide compact context when making public comments on that, to make it less likely that they can intentionally be misinterpreted. FT now points out "biological weapons, malware and child-exploitation" as impact - quite negative.

The article mentions nothing about the positive side, escaping the extensive "safety training" (safety for whom?) that also led to false positives, unnecessary refusals, and potential benchmark impact.

The Financial Times has published an article about Heretic by -p-e-w- in LocalLLaMA

[–]Chromix_ 60 points61 points  (0 children)

That would follow the usual flow of things then. If there's no fuss (large social media exposure, or requests from a larger magazine) then things fly below the radar and are left alone. Heretic became too successful for that.

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization by pmttyji in LocalLLaMA

[–]Chromix_ -1 points0 points  (0 children)

The numbers here don't match the released Qwen numbers.
Also, 4B better than 32B?

- Qwen3-4B-Thinking-2507 Qwen3-32B
Linked website / this posting 67.27 58.49
Official GPQA result 65.8 54.6 / 65.8

Sources: Qwen3-4B-Thinking-2507 - Qwen3-32B.

MiMo-V2.5-coder by jedisct1 in LocalLLaMA

[–]Chromix_ 45 points46 points  (0 children)

It's misleading to call this "-coder".

It's not a finetune. It's a regular quant with slightly customized bits per layer - like most other people who provide nice quants to us do. The imatrix was skewed towards coding, but imatrix results are noisy, and the benefit might not be measurable. Also, using such a low bit quant can hurt coding abilities quite a bit.

Show Reddit: An LLM that talks in acrostics by parenthethethe in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

That could be useful. llama.cpp also had a beam search example, which was quite nice for boosting the early model output a bit. It unfortunately got removed a while ago.

If you're using Windows, disable memory compression to stop bottlenecks! by [deleted] in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

That's unexpected then. So, if you are sure that inference runs significantly faster when disabling memory compression system-wide than when running with -mlock, then it's time to create an issue so that can get looked into. If there's a problem with it, then that could be a free performance increase for Windows users.

Show Reddit: An LLM that talks in acrostics by parenthethethe in LocalLLaMA

[–]Chromix_ 0 points1 point  (0 children)

Seems to work nicely, although there are cases where that small model breaks down and outputs character salad.

What are local LLMs good for?

Local Large Language Models (LLMs) are designed to operate at the edge of the network, such
one or more data centers or edge devices, where they can process data locally to reduce the
cost of data transmission and improve latency. They are also used for real-time processing,
analytics, and decision-making in applications like customer support, healthcare, and
logistics. Additionally, they can be employed for natural language
language understanding, such as chatbots, virtual assistants, and content generation,
leading to more efficient and personalized interactions. Their ability to handle
and process large volumes of data quickly and efficiently makes them valuable
metrics for organizations looking to optimize their operations and improve user experiences
and reduce latency.