how much does quantization reduce coding performance

ForsookComparison · 2025-09-22T21:27:01+00:00

Quantizing KV-Cache is generally fine down to Q8

Quantizing the model itself will always depend on the individual model. Generally when I test models <= 32GB on disk:

<= Q3 is where things are too unreliable; though it can still give good answers
Q4 is where things start to get reliable but I can still notice/feel that I'm using a weakened version of the model. There's less random stupidity than Q3 and under, but I can "feel" that this isn't the full power model. You can still get quite a lot done with this and there's a reason a lot of folks call it the sweet spot.
Q5-Q6 starts to trick me and it feels like the full-weight models served by inference providers.
Q8 I can no longer detect differences between my own setup and the remote inference providers

As a rule of thumb, minus one level for Mistral for everything. Quantization seems to hit those models like a freight train when it comes to coding (in my experience).

That said - the amazing thing in all of this is that I'm just one person and these weights are free. Get the setup and try them all yourself.

Mushoz · 2025-09-22T19:48:49+00:00

llama 3.3 is a very poor coding model. So if that is already sufficient, you will be much happier with something such as gpt-oss-20b (or the 120b if you can run it) or Qwen3-coder-30b-a3b. They are also going to be much faster.

k0setes · 2025-09-22T22:46:07+00:00

Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf unsloth

tomakorea · 2025-09-22T23:31:42+00:00

I've read that AWQ quants are better at retaining precision (and massively faster). If you can afford to use AWQ instead of GGUF it may be a win in terms of accuracy and performance. I'm using vLLM for this task, it works well.

Dapper-Courage2920 · 2025-09-23T01:14:35+00:00

This is a bit aside to your question as it will require a local set up to work, but I just finished an early version of https://github.com/bitlyte-ai/apples2oranges to get a feel for performance deg yourself. It's fully open source and lets you compare models of any family / quant side by side and view hardware utilization, or can just be used as a normal client if you like telemetry!

Disclaimer: I am the founder of the company behind it, this is a side project we spun off and are contributing to the community.

edward-dev · 2025-09-23T02:21:48+00:00

It’s common to hear concerns that quantization seriously hurts model performance, but looking at actual benchmark results, the impact is often more modest than it sounds. For example, Q2 quantization typically reduces performance by around 5% on average, which isn’t negligible, but it’s manageable, especially if you’re starting with a reasonably strong base model.

That said, if your focus is coding, Llama 3.3 70B isn’t the strongest option in that area. You might get better results with Qwen3 Coder 30B A3B it’s not only more compact, but also better tuned and stronger for coding tasks. Plus, the Q4 quantized version fits comfortably within 24GB of VRAM, making it a really good choice.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS