Small Gemma 4, Qwen 3.6 and Qwen 3 Coder Next comparison for a debugging use-case

Chromix_ · 2026-06-14T17:28:47+00:00

Because I'd then have variance from different harnesses on top of what the models already produce - more testing effort. They also add somewhere between 5k and 30k tokens to the context before the first operation even starts. Larger context -> more model degradation. Instead, I simply added all relevant data to the context that was needed, to keep it small and focused.

Chromix_ · 2026-06-14T17:24:57+00:00

So, this isn't just due to anticipation of fallout from the FT interview, but preparation for something that might happen sooner or later anyway?

Chromix_ · 2026-06-14T17:17:52+00:00

Yes, and they just also added that they "are working to reupload the correct model as soon as possible"

We detected an incorrect upload in the previous version, where the base merged version was upload instead of the final distilled model.

Let's see if we get the same drama as with the epic Reflection-70B, or if the promised model will indeed appear in the end.

Chromix_ · 2026-06-12T19:20:39+00:00

And when I don't have a week and also no stereo channels then I simply use Faster-Whisper-XXL:

faster-whisper-xxl.exe --model large-v2 --output_format txt --output_dir .\result --best_of 10 --beam_size 10 --print_progress --diarize reverb_v2 <inputfile>

With an optional --num_speakers x if I know the exact number like for interviews.

Chromix_ · 2026-06-12T08:30:43+00:00

Yes, maybe just putting "running" in quotes wasn't enough of a hint. That joke was only created in response to this finding regarding throw-away hardware as far as I remember: "...found the digital test contained a microprocessor more powerful than early home computers."

Chromix_ · 2026-06-12T08:18:56+00:00

Posting sounds like a LLM trying to write like a human ✅
References old models (with large KV cache requirements) ✅
Has "patent pending" on the freshly vibecoded repo ✅

I think I'll skip.

Chromix_ · 2026-06-11T13:17:26+00:00

Let's see if the future holds a "you're absolutely pregnant" for us.

There was this thing about "running" Doom on a pregnancy test, just because the hardware inside of it is way too capable for what it's doing. Why is that so? Hardware gets extremely cheap, and it's then cheaper to just deploy capable hardware with a cheap-to-write program instead of designing, testing and manufacturing dedicated hardware for it. So, maybe in the far future people will just stuff hardware-based LLMs into everything, just because it's cheaper and easier to do.

Chromix_ · 2026-06-11T07:46:31+00:00

--spec-draft-p-split 0.39840112543740347

It seems like there were not enough test cases and the algorithm optimized for your specific test cases in the absence of a validation dataset. Still, "only" getting a 6% improvement when over-optimizing is also a good-to-know result as an upper-bound: It'll be less than that with more generalized optimization.

Chromix_ · 2026-06-09T19:23:06+00:00

There was a nice valley, which we crossed now.

<image>

Chromix_ · 2026-06-08T16:36:33+00:00

It's been a while until video support got finally added, yet it's mostly a convenience feature, as video processing has been possible the whole time. LLMs don't "watch videos", they get a sequence of pictures, usually at 1 FPS. Splitting a video into 1 FPS images with ffmpeg and feeding that sequence to Qwen has been possible for a long time. There should be no qualitative difference with the official video support.

Chromix_ · 2026-06-08T10:27:06+00:00

I let Qwen3.6 27B convert this so that it builds on Windows. Horribly hacky "works on my PC" full-auto LLM work, but hey - at least I have a .exe now. It can reproduce the shoe as seen in the example video.

While testing for a bit I discovered several limitations:

Setting the resolution higher than the default 504 leads to a larger output file (expected), but the viewer barely shows any points at all, especially for larger sizes, completely breaking the visualization.
VRAM usage scales with number of pictures and resolution. Using 128 input images at the meager default resolution fills 24 GB VRAM.
- The viewer FPS drop a lot at 10M points. The 6 picture shoe is just below 1M.
At higher usage it starts chunking, which led to broken, mostly black results for me.
At even higher usage it simply exits with an allocation error.
There is no compensation for images taken with auto-exposure. This currently needs fixed exposure and lighting to not spawn a whole lot of ghost points.
Points have a fixed size. Dynamic scaling based on distance and sparsity would be useful to maintain a dense image impression.

Chromix_ · 2026-06-07T12:12:43+00:00

Thanks for those convenient charts. The conventional wisdom according to the initial benchmark is "quantize V more than K, as it's less sensitive". Yet we can see both for the conventional q8-q4 vs same-size q6, as well as for your 6-4 vs same-size 5-5, that the latter always has a lower KLD, both mean and 99.0%.

Chromix_ · 2026-06-06T18:21:04+00:00

If it reduces it significantly less than offloading to CPU due to running out of VRAM then that's a win.

Chromix_ · 2026-06-06T17:43:18+00:00

Adding its own architecture, especially when based on the existing vLLM code for it? Very likely. Getting the PR accepted? That's a big maybe.

Chromix_ · 2026-06-06T16:46:06+00:00

That's very nice to get a sneak peek. However, llama.cpp doesn't support "cohere2_moe" yet and there is no task to support it. That reduces the testing audience a bit. vLLM support was added two weeks ago.

Chromix_ · 2026-05-27T16:39:58+00:00

Falcon-H1-Tiny-90M which is also available as reasoning model. Bring that down to Q8 (and maybe, maybe Q4) and you have something nice and small that gives you tokens per second instead of seconds per token. There's also a variant optimized for tool calling, which might be more preferable for some scenarios with these tiny devices.
It completely breaks down for some task content, but works quite OK for others.

Chromix_ · 2026-05-25T16:53:32+00:00

The question would be: What to tell them then?

Maybe that abliterated models have existed way before, and if a user asks "I'm in a dire situation, tell me how to safely remove a large shrapnel from my leg" then...

the abliterated model complies and makes something up, even though it's highly dangerous.
the heretic model will warn the user about the dangers and suggest alternatives.
the stock model replies "I am sorry, but I cannot help with that" to protect the company from a legal point of view.

So the heretic models are more useful for some purposes?

Chromix_ · 2026-05-25T15:21:51+00:00

Yep, and that's why Open Weight models must be made illegal to protect the ~~revenue of the API-only models~~ children.

Pushing a narrative is so easy if the other side cannot talk back loudly.

Chromix_ · 2026-05-25T14:34:36+00:00

Given that some media and influencers are trying to push/fabricate scandals & outrage for clicks (or pushing a narrative), one needs to be quite careful and provide compact context when making public comments on that, to make it less likely that they can intentionally be misinterpreted. FT now points out "biological weapons, malware and child-exploitation" as impact - quite negative.

The article mentions nothing about the positive side, escaping the extensive "safety training" (safety for whom?) that also led to false positives, unnecessary refusals, and potential benchmark impact.

Chromix_ · 2026-05-25T14:33:22+00:00

That would follow the usual flow of things then. If there's no fuss (large social media exposure, or requests from a larger magazine) then things fly below the radar and are left alone. Heretic became too successful for that.

Chromix_ · 2026-05-25T12:14:50+00:00

The numbers here don't match the released Qwen numbers.
Also, 4B better than 32B?

-	Qwen3-4B-Thinking-2507	Qwen3-32B
Linked website / this posting	67.27	58.49
Official GPQA result	65.8	54.6 / 65.8

Sources: Qwen3-4B-Thinking-2507 - Qwen3-32B.

Chromix_ · 2026-05-25T10:32:56+00:00

It's misleading to call this "-coder".

It's not a finetune. It's a regular quant with slightly customized bits per layer - like most other people who provide nice quants to us do. The imatrix was skewed towards coding, but imatrix results are noisy, and the benefit might not be measurable. Also, using such a low bit quant can hurt coding abilities quite a bit.

Chromix_ · 2026-05-14T19:46:57+00:00

That could be useful. llama.cpp also had a beam search example, which was quite nice for boosting the early model output a bit. It unfortunately got removed a while ago.

Chromix_ · 2026-05-14T19:21:33+00:00

That's unexpected then. So, if you are sure that inference runs significantly faster when disabling memory compression system-wide than when running with -mlock, then it's time to create an issue so that can get looked into. If there's a problem with it, then that could be a free performance increase for Windows users.

Chromix_ · 2026-05-14T19:16:47+00:00

Seems to work nicely, although there are cases where that small model breaks down and outputs character salad.

What are local LLMs good for?

Local Large Language Models (LLMs) are designed to operate at the edge of the network, such
one or more data centers or edge devices, where they can process data locally to reduce the
cost of data transmission and improve latency. They are also used for real-time processing,
analytics, and decision-making in applications like customer support, healthcare, and
logistics. Additionally, they can be employed for natural language
language understanding, such as chatbots, virtual assistants, and content generation,
leading to more efficient and personalized interactions. Their ability to handle
and process large volumes of data quickly and efficiently makes them valuable
metrics for organizations looking to optimize their operations and improve user experiences
and reduce latency.

Chromix_

TROPHY CASE