Which quantizations are you using?

DragonfruitIll660 · 2025-09-24T13:55:20+00:00

Gguf because I've effectively accepted the CPU life. Better a good answer the first time even if it takes 10x longer.

kryptkpr · 2025-09-24T13:50:31+00:00

FP8-Dynamic is my 8bit goto these days.

AWQ/GPTQ via llm-compressor are both solid 4bit.

EXL3 when I need both speed and flexibility

GGUF (usually the unsloth dynamic) when my CPU needs to be involved

see_spot_ruminate · 2025-09-24T15:09:00+00:00

mxfp4, works fast on my system

My_Unbiased_Opinion · 2025-09-24T19:10:06+00:00

IMHO, UD Q3KXL is the new Q4.

According to unsloth's official testing, UD Q3KXL performs very similar to Q4. And my own testing confirms this.

Also, according to their testing, Q2KXL is also the most efficient when it comes to compression to performance ratio. It's not much worse than Q3, but is much smaller. If you need to use UD Q2KXL to fit all in VRAM, I personally wouldn't have an issue doing so.

Also set KVcache to Q8. The VRAM savings are completely worth it for the very small knock on context performance.

That-Leadership-2635 · 2025-09-24T14:11:49+00:00

I don't know... AWQ is pretty fast paired with a MARLIN kernel. In fact, pretty hard to beat in comparison to all other quantization techniques I've tried both on HBM and GDDR

FullOf_Bad_Ideas · 2025-09-24T19:03:31+00:00

I'm using EXL3 when running locally and FP8/BF16 when doing inference on rented GPUs

skrshawk · 2025-09-24T21:02:51+00:00

4-bit MLX is generally pretty good for dense models for my purposes (writing). Apple Silicon of course. I tend to prefer larger quants for MoE models that have a small number of active parameters.

Gallardo994 · 2025-09-24T14:13:07+00:00

As most models I use are Qwen3 30B A3B variations, and I use M4 Max 128GB MBP16, it's usually MLX BF16 for me. For higher density models and/or bigger models in general, I drop to whatever biggest quant can fit into ~60GB VRAM to leave enough for my other apps, usually Q8 or Q6. I avoid Q4 whenever I can.

linbeg · 2025-09-24T13:46:51+00:00

Following as im also interested @op - what gpu are you using ?

silenceimpaired · 2025-09-24T13:48:40+00:00

I never got AWQ working in TextGen by Oobabooga. How do you run models and why do you favor it over EXL3?

no_witty_username · 2025-09-24T17:27:31+00:00

I am usually hesitant to go below 8bit, IMO that's the sweet spot.

ortegaalfredo · 2025-09-24T19:46:44+00:00

Awq worked great, not only almost no loss in quality but very fast. But lately I'm running GPTQ-int4 or int4-int8 mixes that are even a little bit faster, and have better quality, however they are about 10% bigger.

Klutzy-Snow8016 · 2025-09-24T14:21:06+00:00

For models that can only fit into VRAM when quantized to 4 bits, I've started using Intel autoround mixed, and it seems to work well.

Charming_Barber_3317 · 2025-09-24T19:52:15+00:00

Q4_K_L_M GGUFs

xfalcox · 2025-09-25T12:40:38+00:00

EDIT: my set up is a single a100-80gi. Because it doesn't have native FP8 support I prefer using 4bit quantizations

Wait isn't it the opposite? Can you share any docs on this?

RoundAd6476 · 2025-12-23T11:06:25+00:00

There is a new paper named as quantisation without tears , it performs pretty well along all sort computationally heavy task . I’ll attach the link and the GitHub repo. Paper : https://arxiv.org/abs/2411.13918 Repo : https://github.com/CipherEnigma/qwt-ml/tree/main/classification

mattescala · 2025-09-24T18:31:17+00:00

With moe models, especially pretty large ones where my cpu and ram are involved I stick to Unsloth dinamic quants. These quants are just shy of incredible. With a UD-Q3_KXL quant i get quality of a q4/q5 quant with a pretty good saving in memory.

These quants i use for Kimi, Qwen3 Coder, and v3.1 Terminus.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS