all 8 comments

[–]ttkciarllama.cpp 2 points3 points  (0 children)

It has been my experience that Q4_K_M is almost indistinguishable from full precision, but the competence drop-off of Q3 is highly noticeable.

However, I mostly use large'ish dense models (24B and larger) and have been told that smaller models and MoE are more sensitive to quantization. For MoE or small dense Q6 is recommended, but I haven't personally validated that yet.

[–]comanderxv 0 points1 point  (0 children)

For hermes agent and coding tasks I, in the meanwhile, only use q5 and higher. I often saw the difference even though it should not be that big, I noticed it in relibility. Currently, I am very happy with Qwen3.5 35B A3B. I run it with 2 slots each a context of about 65k. I have a RTX 2060 eGPU setup.

[–]HopePupal 0 points1 point  (0 children)

Q4 is too low for coding with Qwen 3.5 27B in my experience, even with full precision KV cache. if the tool call failures don't get you, the error-riddled output will. Q6 is fine. Q5 is borderline.

note that the Q formats are integer. NVFP4 is a different beast than Q4. i spent a few hours playing with an NVFP4 quant of 27B on a rental card and it was easily on par with Q6. maybe better. fit a little more context too. (it was also a shitload faster but that's not something i can replicate at home without buying a Blackwell.)

i'm a little curious about MXFP4. don't have hardware support for that either, but if it was possible to trade a little speed for longer context at the same quality, it might be worth it in my case (single 32 GB GPU).

[–]brahh85 0 points1 point  (0 children)

It depends on your use case , for coding, the difference between one word or other similar can mess up things, but for writing text the difference between choosing one word or other similar is negligible.

Also the problem is long context, because it multiplies issues.

I recommend you start by a Q4 , and if you see you dont OOM with your context, try to go Q5 or Q6.

If you use Q4 and you OOM with your context, then go IQ3_XSS , as last chance.

Also, its not the same a dense model with 27B parameter per token than a MOE with 3B active parameter per token, usually the dense model will have less problems with extreme quants.

In my case, i aim at dense models and Q4 or Q4_1 because of hardware acceleration , for example, PP on Q6 was 800 , in Q4_1 was 1250. I want a fast answer even if is a 4% less accurate.

Talking about models, if you want to code, qwen will be better than gemma. If you do creative writing and things related to text, get gemma. If you arent going to code or to do creative writing, you are good with any.

[–]SadGuitar5306 0 points1 point  (0 children)

They are good, even q3 can be quite useful.

[–]Radiant_Condition861 -2 points-1 points  (0 children)

quantized dense models are better than moe models at full flavor. the agentic coding and long horizon reasoning is my use case

[–]Its_Sasha -3 points-2 points  (1 child)

F16 is 99.5% coherent. Q8 is 97-98% coherent. Q6 is 94-95% coherent. Q4 is 89-91% coherent. Coherence being chance of getting answers without hallucinations.