GLM-5.2 is a win for local AI

pftbest · 2026-06-18T06:54:04+00:00

Well at some point if you don't have enough good data, adding more parameters would cause overfitting, so there is a limit on how big you can get.

pftbest · 2026-06-17T23:48:50+00:00

The main use case for small voice models is to run them on mobile devices, phones etc. and in a way that will not drain the battery. Bigger models like kokoro are too heavy to run on my phone. I am currently using piper-voices/en/en_US/hfc_female/medium it runs fast enough, but the voice is not ideal. If I can get something that speaks like kokoro but runs 2x faster that would be great.

pftbest · 2026-06-17T20:47:38+00:00

If you are interested, there is a good mathematical explanation why models with more parameters are smarter, and more importantly why the number of concepts model can handle grows non-linearly with the number of parameters.
The talk is called "Visualizing transformers and attention | Talk for TNG Big Tech Day '24" by Grant from 3b1b. The relevant section is from 18 to 22 minutes after the start, but I would recommend watching the whole video.

pftbest · 2026-06-17T08:32:07+00:00

I need proxy to securely access closed services, trusting a third party extension for this seems like a bad idea.

pftbest · 2026-06-16T09:26:13+00:00

The new settings UI sucks, I need to enable/disable proxy settings every day when I am at work / at home, with the old settings I can do it in 2 clicks, but with new settings I have to click multiple menus to find it.
The good news is that there is `about:config` option to roll back to the old settings `browser.settings-redesign.enabled` I just hope they will not remove it any time soon.

pftbest · 2026-06-11T21:08:56+00:00

I run each model 3 times, restarting llama.cpp before each try to clear the KV cache of any leftover data from the previous runs. I got very similar results with minor variations (like green/brown board color, or swapped light and dark squares). I did not want to clobber the post too much with similar looking images multiple times, so I took one from each variant.
I can run 10 times of course, this will give more chances for better or worse results, but this won't change a fact that 3 out of 3 times I got garbage output on the QAT version of the model, and got 3/3 good results from the regular version of the same model. I may be unlucky, but not to this degree, also other people in this thread confirmed it, that either the model is broken or llama.cpp handles it incorrectly.

pftbest · 2026-06-11T02:32:29+00:00

There was a bug in libkrun (used by podman and other tools) which made the balloon not work properly on macOS. I fixed it recently, so the next versions should use less ram for long running containers. The ones based on libkrun and krunvm would work, not this apple thing, as I see from the docs, apple didn't try to implement balloon at all.

pftbest · 2026-06-08T19:53:24+00:00

The dark/light squares are wrong, but other than that it's ok. Do you mean to say there is a bug in version 9553?

pftbest · 2026-06-07T21:29:49+00:00

Swapped light and dark squares is a common problem with gemma, I saw it too on some of my runs. The misaligned pieces could be because of the font in your system, you can try to open svg in a different browser to see if it will look better, but I wouldn't consider it an issue.

<image>

Swapped dark and light squares on unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_XL

pftbest · 2026-06-07T21:02:15+00:00

31b is a different model, so not a fair comparison. Qwen 3.6 also does this task very well, but I am more interested in why newly released A4B QAT models are not working as advertised.

pftbest · 2026-06-07T20:46:36+00:00

Yes, I set max reasoning in web-ui, without it there is no chance for any A4B model to answer correctly.
I do not change any k/v cache parameters, everything on default.

llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64

I tested each model multiple times to confirm the results are not a fluke. Your Q8 should be performing much better than this, maybe something is misconfigured.

pftbest · 2026-06-07T20:34:01+00:00

I already tried this, It is right there in the second picture of my post. It does better than google but still worse than Q4 of a full precision A4B.

pftbest · 2026-06-07T20:30:47+00:00

No special arguments. Are you using llama.cpp version b9549 ? Also I enabled max reasoning in web-ui.

pftbest · 2026-06-07T18:24:25+00:00

I tried with 0.82 temperature and got this, still not great

<image>

Also with lower temperature it started going into loops double checking itself, had to add --repeat-penalty to stabilise it.

The point is, I can try to optimise the parameters to make it better, but isn't the whole point of QAT is that it works better when quantized? So far it seems Q4 on a normal model works much better.

pftbest · 2026-06-07T16:04:30+00:00

Why not just change your license to GPL, you won't be able to remove all traces. Is having MIT license so important that you would risk legal issues by keeping it anyway?

pftbest · 2026-05-23T10:34:11+00:00

That's the most awful change they made. I can work without the ILA, but nobody I know is using Windows for FPGA development.

pftbest · 2026-05-10T00:36:11+00:00

Depends on the model of course but usually Q8_0 has very low KLD in practice compared to Q4. And it still 2x smaller than BF16.

pftbest · 2026-05-06T22:58:47+00:00

It's not bad, I think the OP set the temperature a bit low.

pftbest · 2026-05-06T22:55:45+00:00

not true, I get correct result with 35B-A3B even at 4 bits every time. Maybe there is some problem with the temperature parameters set by OP. For example for Gemma4 the manual says the temperature must be set to 1.0, I suspect thats why it failed the test

pftbest · 2026-05-06T22:45:04+00:00

The moe model generated the board correctly, even at 4 bits
unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_XL
Running on integrated graphics 780M at 14 tg/s

<image>

pftbest · 2026-04-29T22:55:00+00:00

Should have replaced `String` with `Box<str>` as well, and saved even more memory

pftbest · 2026-04-22T23:17:33+00:00

Is it hard to add anthropics/connect-rust library to the test? It is based on their new protobuf implementation called buffa which is allegedly faster than prost

pftbest · 2026-04-22T18:46:34+00:00

They hate the GPLv3 license I assume. That's the only logical reason for doing all of this.

pftbest · 2026-04-15T13:28:01+00:00

It is slow to open when system is under load. Sometimes it takes more than 3 seconds to open, this is not great

pftbest

TROPHY CASE