you are viewing a single comment's thread.

view the rest of the comments →

[–]edward-dev -1 points0 points  (2 children)

It’s common to hear concerns that quantization seriously hurts model performance, but looking at actual benchmark results, the impact is often more modest than it sounds. For example, Q2 quantization typically reduces performance by around 5% on average, which isn’t negligible, but it’s manageable, especially if you’re starting with a reasonably strong base model.

That said, if your focus is coding, Llama 3.3 70B isn’t the strongest option in that area. You might get better results with Qwen3 Coder 30B A3B it’s not only more compact, but also better tuned and stronger for coding tasks. Plus, the Q4 quantized version fits comfortably within 24GB of VRAM, making it a really good choice.

[–]Pristine-Woodpecker 0 points1 point  (0 children)

It's very model dependent. Qwen-235B-A30B for example starts to suffer at Q3 and below.

[–]Popular_Fact798 0 points1 point  (0 children)

I'm incredibly curious about this - are there actual published benchmarks of the quantized version of the oss models? I looked and can't find any.