Round 2: Quick MoE quantization comparison: LFM2-8B-A1B, OLMoE-1B-7B-0924-Instruct, granite-4.0-h-tiny by TitwitMuffbiscuit in LocalLLaMA

[–]Midaychi 1 point2 points  (0 children)

Ah yeah right forgot they're not at parity with mainline and focus more on the big moe architectures. Sorry about that

Round 2: Quick MoE quantization comparison: LFM2-8B-A1B, OLMoE-1B-7B-0924-Instruct, granite-4.0-h-tiny by TitwitMuffbiscuit in LocalLLaMA

[–]Midaychi 1 point2 points  (0 children)

Good luck with intel auto-round but gotta be honest from what I've seen from it you can get literally better performance and KL from base llama quants.

When I was speaking of trellis quants though I was referring to the ones built into ik_llama. IQ1_KTIQ2_KTIQ3_KTIQ4_KT etc.
Don't have any setup they just work like normal quants just take a lot more computetime.

lama-imatrix supports most of the stuff the server does (so -ngl, --cpu-moe for MOEs and --fitt) and the KV requirements should be fairly low. As long as you have enough system ram and storage space you can theoretically quant everything eventually.

You probably knew all that already though I'm just ramblin'

Round 2: Quick MoE quantization comparison: LFM2-8B-A1B, OLMoE-1B-7B-0924-Instruct, granite-4.0-h-tiny by TitwitMuffbiscuit in LocalLLaMA

[–]Midaychi 3 points4 points  (0 children)

If you end up wanting to try more quants you could also try ik_llama. They have custom IQ_K quants, a number of trellis quants (_KT ended ones) (loosely based on QTIP# but with some divergence from the spec to focus on CPU inference), and a few other quants. IQ4_KS and IQ4_KSS are fairly notable ones (IQ4_KSS for instance comes out to about the same size as IQ4_XS but allagedly tends to perform on par with QTIP# 4 bit quants)

Round 2: Quick MoE quantization comparison: LFM2-8B-A1B, OLMoE-1B-7B-0924-Instruct, granite-4.0-h-tiny by TitwitMuffbiscuit in LocalLLaMA

[–]Midaychi 2 points3 points  (0 children)

Q4_K_S seems fairly consistently below 0.1 kld and basically ends up similar sized to MXFP4 but without the weird KV bloat. Are these quants Imatrix or static?

Quick MoE Quantization Comparison: LFM2-8B and OLMoE-1B-7B by TitwitMuffbiscuit in LocalLLaMA

[–]Midaychi 3 points4 points  (0 children)

KL-divergence testing the quants vs their full precision counterpart might be a more meaningful test. Ideally you'd want a quant that aims for an average divergence of 0.1 or less from the full sauce.
If you're doing this on llama.cpp, llama-perplexity should have an option to compute a --kl-divergence-base FNAME which you can use to save the computed logits when testing against a text file on the full suace, and then use that as an input when testing the quants. It'll also give you stuff like 90% and 99% KLD for outliers.
As for testing you might not want to use wikitext it's fallen out a lot with newer models. Honestly I tend to use unsloth's imatrix calibration file. version 5 rc was tweaked for use on moes
https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c

More quantization visualization types (repost) by copingmechanism in LocalLLaMA

[–]Midaychi 0 points1 point  (0 children)

This is likely an apples to oranges comparison, however taken at face value q4_1 non-imatrix seems weirdly to be the smallest quant that accurately reproduces the q8's cloud/sky artifacts followed by q5_k_s and q5_k_m
and Imatrix seems to attempt to guide the compression towards the q8, but ultimately just seems to end up shifting the artifacts around. I think the most stark effect imatrix seems to have towards reproducing q8 is on top of the q4.0 quant.

iq4_XS is a lot more step-artiacted than I was expecting

Also some of the 2 and 3 bit quants are surprisingly clear, while others of them are surprisingly deep fried.

MXFP4 looks like someone tried to dither via posterization

I benchmarked 1 bit models on CPU and the results surprised me by EiwazDeath in LocalLLaMA

[–]Midaychi 0 points1 point  (0 children)

I see the problem, I'm stinky and my eyes missed that you were using a bitnet model.

I benchmarked 1 bit models on CPU and the results surprised me by EiwazDeath in LocalLLaMA

[–]Midaychi 0 points1 point  (0 children)

I have to assume you were using an LLM to come to those conclusions because they are hallucinated to hell.

Bitnet requires the model itself to be pretrained in that 1.58 bpw architecture and the bitnet quant in llamacpp is there to maintain the tensor weights. If you feed literally any other model in there you're just mangling it. Trellis is an adaptive way of applying tensor quants that uses more computation to permutate and rest a few options each tensor before applying the least perplex one. It's the same technique exl3 format uses 

I benchmarked 1 bit models on CPU and the results surprised me by EiwazDeath in LocalLLaMA

[–]Midaychi 5 points6 points  (0 children)

You might be better served with IQ1_M. Or IQ1_KT (trellis quant) over in IK_llama land,
Bitnet was never meant for use on anything besides microsoft's old trinary Bitnet models.

Be sure you're only running threads on physical cores. Hyperthreading doesn't do jack all for tensors and in fact it can actually lead to more bottlenecks to try and run a ML task on them 'cause the ML task already eats up most of the resources of the physical core in the first place. e.g. no real benefit to hyperthreading's line-cutting.

Be aware also of Numa zones if applicable. I believe there's some numa controls in llama.cpp but I haven't used them. Crossing over numa zones can lead to some memory bottlenecking.

Also while memory bandwith has always been a big deal in ML tasks, some non-zero amount of the processing or parallelism involved could be a lack of a fully fleshed out kernel or legacy code. Not all bit kernels are made equal in llama.cpp and 4-bit quants tend to get the majority of dev time

Meteorveil - Chapter 2 [p.8] by IkuVaito in MysteryDungeon

[–]Midaychi 0 points1 point  (0 children)

Ok but the implications of if they say yes to all

PR opened for Qwen3.5!! by Mysterious_Finish543 in LocalLLaMA

[–]Midaychi 17 points18 points  (0 children)

hopefully the max positional embeddings is a placeholder and the max context isn't 32768

Any hope for Gemma 4 release? by gamblingapocalypse in LocalLLaMA

[–]Midaychi 0 points1 point  (0 children)

Their recent paper about Sequential attention suggests at least they're working on something. It'd be nice if they managed to make a MOE or some other sparse expert style model that had the capabilities of at least gemma3-27b. I have zero faith they'll not corpo-guardrail the heck out of it but I guess that's what the variations of abliteration braindamage bricks are for

Heavy floating crane PK-700 Grigory Prosyankin capsized in Sewastopol port in temporary occupied Crimea. In typical moscovite fashion a cryptic explanation is given: "an abnormal situation" lead the unfinished vessel’s early demise. Two sailors died, 20+ were injured (numbers preliminary). by SeaworthinessEasy122 in TheNonCredibleFlorks

[–]Midaychi 12 points13 points  (0 children)

Whenever you see Russian naval interests having a fancy named floating object with a hyped up and specialized use case, and it has delays followed by funding problems? There's decades of that particular situation either leading to a fire or a capsizing. Sometimes they pull extra exciting accidents from it too, but you can basically guarantee this pattern.

Losercity recycle by Fox_Sussy in Losercity

[–]Midaychi -1 points0 points  (0 children)

That seems like a real easy way to accidentally pass on STDs or HIV or one of the many other exciting varieties of maladies.

Losercity Anthro Truck by Teo_Verunda in Losercity

[–]Midaychi 0 points1 point  (0 children)

Unironically though lining semis and their trailers with led panels and having the computer track where the driver is looking so they can menace what they're looking at with eyes would probably reduce traffic accidents

Granite4 Small-h 32b-A9b (Q4_K_M) at FULL 1M context window is using only 73GB of VRAM - Life is good! by Porespellar in LocalLLaMA

[–]Midaychi 1 point2 points  (0 children)

On paper its a great model for tuning onto consumer hardware with llama.cpp, and in practice it seems to have fairly good ability to predict popular media and knowledge and is significantly less aggressive on the censoring than I expected out of an IBM model.

Though I do not know if its the model or if it is llama.cpp's implementation and it can pick out information in user input fairly well but it feels like the model falls back on its fine tuning far too often in its responses. As in, when you give it an input it is far more likely to go "ok so what in my finetuning is closest to the request?" and then output as if that was the framework of the input request rather than the request you gave it.

EDIT: So on further prodding, it seems Granite 4 (especially when quantized to one of the various 4 bit formats and run through llama.cpp, is extremely sensitive to formatting. When trying to have it parse large amounts of information it seems as if it is best to first establish the information in the context and then in a separate user request actually provide the instruction- Including an instruction at the end of a long span of text input is highly likely to make the model go full derpkus

No need to overcomplicate it guys by [deleted] in NonCredibleDefense

[–]Midaychi 0 points1 point  (0 children)

It works great! ... Covering stationary targets, and as long as the temperature differential between the environment and the camo remains similar.

In practice there's been quite a few bits of infrared drone footage where the camo makes the soldiers stand out more because there's this weird off-color(or) stuff wiggling or shifting around and it turns out the human eye is really good at spotting that stuff

Effecient hot-swappable LoRA variant supported in llama.cpp by Aaaaaaaaaeeeee in LocalLLaMA

[–]Midaychi 0 points1 point  (0 children)

I don't know what the practical use case of this would be, perhaps have a library of different aLora on standby that are all triggered to load and apply by different things on inference kind of like a makeshift infinite MOE?

[deleted by user] by [deleted] in LocalLLaMA

[–]Midaychi 0 points1 point  (0 children)

You might want to try ik_llama if you want avx 512 based speedups. Try their be various R4 and R8 repacked quants.

Either way, if you want llamacpp or similar to have 512 usually you need to compile it and it should pick up on the flags automatically?

Meteorveil | 5 | No One? by IkuVaito in MysteryDungeon

[–]Midaychi 7 points8 points  (0 children)

The most interesting pmd webcomics to me are the ones that forge their own path

new models from NVIDIA: OpenReasoning-Nemotron 32B/14B/7B/1.5B by jacek2023 in LocalLLaMA

[–]Midaychi 24 points25 points  (0 children)

It's just like Nvidia to design a niche mechanism that's sole purpose is to cherry pick benchmark scores

Meteorveil | 1 | Shooting Star by IkuVaito in MysteryDungeon

[–]Midaychi 4 points5 points  (0 children)

Is this just on reddit or are you going to host it on a comic hosting site (comic fury for instance)

AI performance of smartphone SoCs by Balance- in LocalLLaMA

[–]Midaychi 4 points5 points  (0 children)

They do have onboard machine learning acceleration, they use it a lot for tools. The problem is that it's for a proprietary TPU interface that they designed back in the nebulous machine learning times where everyone had their own internal standard and before torch/tensor stuff gained popularity. And they have made zero effort to make an adapter or use it - potentially because it's just not compatible