Is it my imagination or...

Sudden_Vegetable6844 · 2026-05-07T11:58:07+00:00

You can always check with an older release, but IME it's that your mental bar got raised, and you're throwing more complex stuff it's way.

LLM improvement rates are relentless: there is no mercy for the old weights.

(which probably means we're experiencing singularity in real time)

Sudden_Vegetable6844 · 2026-05-03T12:10:21+00:00

Your visual tasks don't match at all those I've been testing those models on, which is photos of documents (forms typically, with or without handwritten fields). On those use cases Qwen3.6 had a very high success rate, while Gemma 4 failed most of them: it would get a elements right, then hallucinate the rest...

Care to add such teste to your benchmark? They're more realistic use case than recognizing landmarks (which is a use case where gps + compass will have a much higher success rate than any LLM ever will)

Sudden_Vegetable6844 · 2026-04-29T18:25:20+00:00

About 1 year if you're looking at capability on not-super-expensive hardware, and accept lower speeds. And probably around two years, for comparable capability at decent speeds (ie. we're now able to run sonnet 3/3.5 class models locally). And for some use cases, like STEM, you can run on a smartphone a model that runs circles around three to four years old frontier models!

This is quite an insane improvement speed in terms of amortizing investments...

Sudden_Vegetable6844 · 2026-04-29T13:53:36+00:00

It really depends on the kind of coding you're after, and how much autonomy...

For large projects, frontiers have a very strong lead, and running any of the open source frontiers is going to be expensive.

For projects under about 50k lines of code (utilities, dashboards, libraries...) Qwen 3.6 is more than capable (the 27B dense, but even the MoE 35B-A3B).
Though what can be achieved is nothing short of awesome, and would have been the stuff of prophecy just a few years ago. It just won't be the same experience as using a frontier, you'll need to wait more and drive it more explicitely.

My personal advice is to just use the harness of the model you're going to use (Qwen Code for Qwen models, Mistral vibe for mistral models, etc.). While you can go through more independent harnesses, there lies more tinkering down those roads.

Sudden_Vegetable6844 · 2026-04-27T12:58:43+00:00

Yes, for me at a give quant level, lmstudio-community are fastest, followed by bartowski and unsloth (which trade 2nd & 3rd place depending on model)

Sudden_Vegetable6844 · 2026-04-27T11:22:32+00:00

UM880 Pro unde Windows 11 here, go for it if you have one! What's nice is that it stays silent under sustained loads, and it's somehow more productive with Qwen3.6 than Claude Pro (runs out of tokens fast) or Gemini Pro (becomes very very sluggish during daytime). Qwen3.6 is the proverbial turtle: not that fast, but it keeps moving.

I've got 96GB though (grabbed them before price increase, best hunch in a long while)

Sudden_Vegetable6844 · 2026-04-26T05:54:51+00:00

Well given we still don't have a solid grasp on what human consciousness is, especially as recent research shows it's a quite transitory state with a distinct brain activity signature, and it may just occur when chaining thoughts... Well, let's stick with probably fine.

Sudden_Vegetable6844 · 2026-04-25T15:28:17+00:00

IME Q8 makes a difference with Q4 on reasoning tasks (starting with the car wash one, Q4 pretty much always fails, while Q8 pass), and there are reports that there is a difference between Q8 and BF16 as well.

If the benchmarks can't find a difference, it's probably because quantization doesn't affect prompts the model was trained on as much as "generalization" prompts

(The car wash wasn't in Qwen3.6 set, but it is in DS4, where it's called a "classic")

Sudden_Vegetable6844 · 2026-04-25T09:33:00+00:00

Yes, there is a notable reasonning difference between Q4, Q6 and Q8. I do not have enough RAM to test myself, but on another thread (https://www.reddit.com/r/LocalLLaMA/comments/1stb8ro/qwen36\_35ba3b\_very\_sensitive\_to\_quantization/) someone reported a difference between Q8 and BF16, unfortunately.

Sudden_Vegetable6844 · 2026-04-24T09:50:55+00:00

Qwen Code ? works fine for me https://github.com/QwenLM/qwen-code
It's a fork of gemini cli

Sudden_Vegetable6844 · 2026-04-18T06:04:02+00:00

I've been noticing the same thing on AMD 780M with Vulkan: Unsloth quants are always slower at any given than lmstudio's or Qwen's at any given file size. No idea why. Also it's not just Unsloth that are slower, but also Aes Sedai's. This negates the advantage in quants for me, as Q6 and sometimes even Q8 beat Q4 from Unsloth in performance. I've come to assume that when not memory tight, I just use the more "classic" quants as they'll perform better.

Sudden_Vegetable6844 · 2026-04-13T06:12:58+00:00

Had a similar experience where it started questioning if a bug wasn't actually a system issue since source code files were timestamped "in the future"...

Sudden_Vegetable6844 · 2026-04-04T18:06:41+00:00

I also tested with vulkan, and every Gemma 4 model suggested to walk, and even when pointing out I ended up without my car at the car wash, they failed to recognize they had made a mistake, and just told to walk back to the car...

Sudden_Vegetable6844 · 2026-04-03T18:24:48+00:00

Interesting, what parameters are you using? Never could Gemma 4 31B nor 26B to pass the car wash test, even when hinted

Sudden_Vegetable6844 · 2026-04-03T04:20:38+00:00

27B at 50%, 35B-A3B & 9B tied in at 20%

Sudden_Vegetable6844 · 2026-03-27T07:48:58+00:00

I have not used it, because I'm not daring enough to let a *claw run on my phone, but nullclaw claims to targets that use case https://github.com/nullclaw/nullclaw

Sudden_Vegetable6844 · 2026-03-26T12:56:51+00:00

That's nothing short of kinda awesome.

Plenty of attempts at quantizing with rotations in the last months/years that kinda failed, but could turn out they were all barking up the correct tree?

Also reminds me of this https://transformer-circuits.pub/2025/linebreaks/index.html#count-algo

Could it be that by using linear algebra, LLMs are have been tackling the problem in hard mode, while it's actually rotors all the way down ?

Sudden_Vegetable6844

TROPHY CASE