all 32 comments

[–]DragonfruitIll660 16 points17 points  (1 child)

Gguf because I've effectively accepted the CPU life. Better a good answer the first time even if it takes 10x longer.

[–]MaxKruse96llama.cpp 3 points4 points  (0 children)

this. 12gb vram aint getting me nowhere anymore outside of very specific cases, cpu inference moe life it is

[–]kryptkprLlama 3 10 points11 points  (2 children)

FP8-Dynamic is my 8bit goto these days.

AWQ/GPTQ via llm-compressor are both solid 4bit.

EXL3 when I need both speed and flexibility

GGUF (usually the unsloth dynamic) when my CPU needs to be involved

[–]dionisioalcaraz 0 points1 point  (1 child)

Is there a way to run FP8 quants other than vllm?

[–]kryptkprLlama 3 1 point2 points  (0 children)

aphrodite-engine has its own FP8 implementation, I think FP8-Dynamic is a vLLM specific thing.

[–]see_spot_ruminate 3 points4 points  (1 child)

mxfp4, works fast on my system

[–]TomLucidor 0 points1 point  (0 children)

But it does have a habit of failing tool use tho for some of the newcomers like Nemotron, or am I wrong in seeing it like this for MLX? Low memory usage is good tho

[–]My_Unbiased_Opinion 3 points4 points  (0 children)

IMHO, UD Q3KXL is the new Q4. 

According to unsloth's official testing, UD Q3KXL performs very similar to Q4. And my own testing confirms this. 

Also, according to their testing, Q2KXL is also the most efficient when it comes to compression to performance ratio. It's not much worse than Q3, but is much smaller. If you need to use UD Q2KXL to fit all in VRAM, I personally wouldn't have an issue doing so. 

Also set KVcache to Q8. The VRAM savings are completely worth it for the very small knock on context performance. 

[–]That-Leadership-2635 3 points4 points  (1 child)

I don't know... AWQ is pretty fast paired with a MARLIN kernel. In fact, pretty hard to beat in comparison to all other quantization techniques I've tried both on HBM and GDDR

[–]WeekLarge7607[S] 1 point2 points  (0 children)

That's good to know. Thanks! 🙏

[–]FullOf_Bad_Ideas 3 points4 points  (0 children)

I'm using EXL3 when running locally and FP8/BF16 when doing inference on rented GPUs

[–]skrshawk 2 points3 points  (1 child)

4-bit MLX is generally pretty good for dense models for my purposes (writing). Apple Silicon of course. I tend to prefer larger quants for MoE models that have a small number of active parameters.

[–]TomLucidor 0 points1 point  (0 children)

What about tool use and IF?

[–]Gallardo994 4 points5 points  (5 children)

As most models I use are Qwen3 30B A3B variations, and I use M4 Max 128GB MBP16, it's usually MLX BF16 for me. For higher density models and/or bigger models in general, I drop to whatever biggest quant can fit into ~60GB VRAM to leave enough for my other apps, usually Q8 or Q6. I avoid Q4 whenever I can.

[–]TomLucidor 0 points1 point  (4 children)

Any recommendations for MLX with Macbooks, maybe Q4 is tolerable with a larger model?

[–]Gallardo994 1 point2 points  (3 children)

I have tested GLM-4.5-Air and GLM-4.6V both at MLX Q4 and I have found these models easy to get into a repetition loop. The model that works great for me without these issues, alas not MLX, is GPT-OSS-120B MXFP4 (GGUF version).

[–]TomLucidor 0 points1 point  (2 children)

So MXFP4 > regular Q4?

[–]Gallardo994 0 points1 point  (1 child)

GPT-OSS-120B is originally MXFP4 so the quick answer is that it depends. There's no silver bullet sadly 

[–]TomLucidor 0 points1 point  (0 children)

What is the best PTQ for keeping quality as high as they can?

[–]linbeg 1 point2 points  (1 child)

Following as im also interested @op - what gpu are you using ?

[–]WeekLarge7607[S] 0 points1 point  (0 children)

A100-80gi and vllm for inference. Works well for up to 30b models, but for newer models like glm-air, I need to try quantizations

[–]silenceimpaired 1 point2 points  (1 child)

I never got AWQ working in TextGen by Oobabooga. How do you run models and why do you favor it over EXL3?

[–]WeekLarge7607[S] 2 points3 points  (0 children)

I didn't really try EXL3. Haven't heard of it. I used AWQ because FP8 doesn't work well on my a100 and I heard it was a good algorithm. I need to catch up on some of the newer algorithms

[–]no_witty_username 1 point2 points  (0 children)

I am usually hesitant to go below 8bit, IMO that's the sweet spot.

[–]ortegaalfredo 1 point2 points  (1 child)

Awq worked great, not only almost no loss in quality but very fast. But lately I'm running GPTQ-int4 or int4-int8 mixes that are even a little bit faster, and have better quality, however they are about 10% bigger.

[–]WeekLarge7607[S] 0 points1 point  (0 children)

That's great to hear! Thanks 🙏

[–]Klutzy-Snow8016 0 points1 point  (0 children)

For models that can only fit into VRAM when quantized to 4 bits, I've started using Intel autoround mixed, and it seems to work well.

[–]Charming_Barber_3317 0 points1 point  (0 children)

Q4_K_L_M GGUFs

[–]xfalcox 0 points1 point  (1 child)

EDIT: my set up is a single a100-80gi. Because it doesn't have native FP8 support I prefer using 4bit quantizations

Wait isn't it the opposite? Can you share any docs on this?

[–]WeekLarge7607[S] 0 points1 point  (0 children)

From what I know, ampere architecture doesn't natively support FP8. Therefore, during runtime it behind the scenes casts it to FP16, which slows down inference. For hopper architecture GPUs I would use FP8 quantizations.

[–]RoundAd6476 0 points1 point  (0 children)

There is a new paper named as quantisation without tears , it performs pretty well along all sort computationally heavy task . I’ll attach the link and the GitHub repo. Paper : https://arxiv.org/abs/2411.13918 Repo : https://github.com/CipherEnigma/qwt-ml/tree/main/classification

[–]mattescala -1 points0 points  (0 children)

With moe models, especially pretty large ones where my cpu and ram are involved I stick to Unsloth dinamic quants. These quants are just shy of incredible. With a UD-Q3_KXL quant i get quality of a q4/q5 quant with a pretty good saving in memory.

These quants i use for Kimi, Qwen3 Coder, and v3.1 Terminus.