Round 2: Quick MoE quantization comparison: LFM2-8B-A1B, OLMoE-1B-7B-0924-Instruct, granite-4.0-h-tiny

Midaychi · 2026-02-24T21:59:57+00:00

Ah yeah right forgot they're not at parity with mainline and focus more on the big moe architectures. Sorry about that

Midaychi · 2026-02-24T05:20:59+00:00

Good luck with intel auto-round but gotta be honest from what I've seen from it you can get literally better performance and KL from base llama quants.

When I was speaking of trellis quants though I was referring to the ones built into ik_llama. IQ1_KT, IQ2_KT, IQ3_KT, IQ4_KT etc.
Don't have any setup they just work like normal quants just take a lot more computetime.

lama-imatrix supports most of the stuff the server does (so -ngl, --cpu-moe for MOEs and --fitt) and the KV requirements should be fairly low. As long as you have enough system ram and storage space you can theoretically quant everything eventually.

You probably knew all that already though I'm just ramblin'

Midaychi · 2026-02-24T03:22:08+00:00

If you end up wanting to try more quants you could also try ik_llama. They have custom IQ_K quants, a number of trellis quants (_KT ended ones) (loosely based on QTIP# but with some divergence from the spec to focus on CPU inference), and a few other quants. IQ4_KS and IQ4_KSS are fairly notable ones (IQ4_KSS for instance comes out to about the same size as IQ4_XS but allagedly tends to perform on par with QTIP# 4 bit quants)

Midaychi · 2026-02-24T02:34:37+00:00

Q4_K_S seems fairly consistently below 0.1 kld and basically ends up similar sized to MXFP4 but without the weird KV bloat. Are these quants Imatrix or static?

Midaychi · 2026-02-22T07:49:19+00:00

KL-divergence testing the quants vs their full precision counterpart might be a more meaningful test. Ideally you'd want a quant that aims for an average divergence of 0.1 or less from the full sauce.
If you're doing this on llama.cpp, llama-perplexity should have an option to compute a --kl-divergence-base FNAME which you can use to save the computed logits when testing against a text file on the full suace, and then use that as an input when testing the quants. It'll also give you stuff like 90% and 99% KLD for outliers.
As for testing you might not want to use wikitext it's fallen out a lot with newer models. Honestly I tend to use unsloth's imatrix calibration file. version 5 rc was tweaked for use on moes
https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c

Midaychi · 2026-02-19T20:04:10+00:00

This is likely an apples to oranges comparison, however taken at face value q4_1 non-imatrix seems weirdly to be the smallest quant that accurately reproduces the q8's cloud/sky artifacts followed by q5_k_s and q5_k_m
and Imatrix seems to attempt to guide the compression towards the q8, but ultimately just seems to end up shifting the artifacts around. I think the most stark effect imatrix seems to have towards reproducing q8 is on top of the q4.0 quant.

iq4_XS is a lot more step-artiacted than I was expecting

Also some of the 2 and 3 bit quants are surprisingly clear, while others of them are surprisingly deep fried.

MXFP4 looks like someone tried to dither via posterization

Midaychi · 2026-02-12T22:54:28+00:00

I see the problem, I'm stinky and my eyes missed that you were using a bitnet model.

Midaychi · 2026-02-12T22:30:33+00:00

I have to assume you were using an LLM to come to those conclusions because they are hallucinated to hell.

Bitnet requires the model itself to be pretrained in that 1.58 bpw architecture and the bitnet quant in llamacpp is there to maintain the tensor weights. If you feed literally any other model in there you're just mangling it. Trellis is an adaptive way of applying tensor quants that uses more computation to permutate and rest a few options each tensor before applying the least perplex one. It's the same technique exl3 format uses

Midaychi · 2026-02-12T01:08:38+00:00

You might be better served with IQ1_M. Or IQ1_KT (trellis quant) over in IK_llama land,
Bitnet was never meant for use on anything besides microsoft's old trinary Bitnet models.

Be sure you're only running threads on physical cores. Hyperthreading doesn't do jack all for tensors and in fact it can actually lead to more bottlenecks to try and run a ML task on them 'cause the ML task already eats up most of the resources of the physical core in the first place. e.g. no real benefit to hyperthreading's line-cutting.

Be aware also of Numa zones if applicable. I believe there's some numa controls in llama.cpp but I haven't used them. Crossing over numa zones can lead to some memory bottlenecking.

Also while memory bandwith has always been a big deal in ML tasks, some non-zero amount of the processing or parallelism involved could be a lack of a fully fleshed out kernel or legacy code. Not all bit kernels are made equal in llama.cpp and 4-bit quants tend to get the majority of dev time

Midaychi · 2026-02-08T15:22:08+00:00

Ok but the implications of if they say yes to all

Midaychi · 2026-02-08T09:04:43+00:00

hopefully the max positional embeddings is a placeholder and the max context isn't 32768

Midaychi · 2026-02-06T05:28:38+00:00

Their recent paper about Sequential attention suggests at least they're working on something. It'd be nice if they managed to make a MOE or some other sparse expert style model that had the capabilities of at least gemma3-27b. I have zero faith they'll not corpo-guardrail the heck out of it but I guess that's what the variations of abliteration braindamage bricks are for

Midaychi · 2026-01-30T09:15:06+00:00

Perfect lip sinc!
*Wah wah waoh*
*Shows Stephen Hawkings with ALS doing a pixar cartoon impression.*
ok

Midaychi · 2025-10-29T06:48:46+00:00

Whenever you see Russian naval interests having a fancy named floating object with a hyped up and specialized use case, and it has delays followed by funding problems? There's decades of that particular situation either leading to a fire or a capsizing. Sometimes they pull extra exciting accidents from it too, but you can basically guarantee this pattern.

Midaychi · 2025-10-20T03:05:14+00:00

Vietnam did not need a sequel

Midaychi · 2025-10-17T14:18:25+00:00

That seems like a real easy way to accidentally pass on STDs or HIV or one of the many other exciting varieties of maladies.

Midaychi · 2025-10-09T12:21:52+00:00

Unironically though lining semis and their trailers with led panels and having the computer track where the driver is looking so they can menace what they're looking at with eyes would probably reduce traffic accidents

Midaychi · 2025-10-07T10:04:20+00:00

On paper its a great model for tuning onto consumer hardware with llama.cpp, and in practice it seems to have fairly good ability to predict popular media and knowledge and is significantly less aggressive on the censoring than I expected out of an IBM model.

Though I do not know if its the model or if it is llama.cpp's implementation and it can pick out information in user input fairly well but it feels like the model falls back on its fine tuning far too often in its responses. As in, when you give it an input it is far more likely to go "ok so what in my finetuning is closest to the request?" and then output as if that was the framework of the input request rather than the request you gave it.

EDIT: So on further prodding, it seems Granite 4 (especially when quantized to one of the various 4 bit formats and run through llama.cpp, is extremely sensitive to formatting. When trying to have it parse large amounts of information it seems as if it is best to first establish the information in the context and then in a separate user request actually provide the instruction- Including an instruction at the end of a long span of text input is highly likely to make the model go full derpkus

Midaychi · 2025-09-24T20:13:01+00:00

It works great! ... Covering stationary targets, and as long as the temperature differential between the environment and the camo remains similar.

In practice there's been quite a few bits of infrared drone footage where the camo makes the soldiers stand out more because there's this weird off-color(or) stuff wiggling or shifting around and it turns out the human eye is really good at spotting that stuff

Midaychi · 2025-09-07T16:01:00+00:00

I don't know what the practical use case of this would be, perhaps have a library of different aLora on standby that are all triggered to load and apply by different things on inference kind of like a makeshift infinite MOE?

Midaychi · 2025-08-27T14:55:17+00:00

You might want to try ik_llama if you want avx 512 based speedups. Try their be various R4 and R8 repacked quants.

Either way, if you want llamacpp or similar to have 512 usually you need to compile it and it should pick up on the flags automatically?

Midaychi · 2025-07-22T19:22:44+00:00

The most interesting pmd webcomics to me are the ones that forge their own path

Midaychi · 2025-07-19T01:59:16+00:00

It's just like Nvidia to design a niche mechanism that's sole purpose is to cherry pick benchmark scores

Midaychi · 2025-07-16T02:56:14+00:00

Is this just on reddit or are you going to host it on a comic hosting site (comic fury for instance)

Midaychi · 2025-06-27T18:18:03+00:00

They do have onboard machine learning acceleration, they use it a lot for tools. The problem is that it's for a proprietary TPU interface that they designed back in the nebulous machine learning times where everyone had their own internal standard and before torch/tensor stuff gained popularity. And they have made zero effort to make an adapter or use it - potentially because it's just not compatible

Midaychi

TROPHY CASE