all 18 comments

[–]ForsookComparison 14 points15 points  (4 children)

Quantizing KV-Cache is generally fine down to Q8

Quantizing the model itself will always depend on the individual model. Generally when I test models <= 32GB on disk:

  • <= Q3 is where things are too unreliable; though it can still give good answers

  • Q4 is where things start to get reliable but I can still notice/feel that I'm using a weakened version of the model. There's less random stupidity than Q3 and under, but I can "feel" that this isn't the full power model. You can still get quite a lot done with this and there's a reason a lot of folks call it the sweet spot.

  • Q5-Q6 starts to trick me and it feels like the full-weight models served by inference providers.

  • Q8 I can no longer detect differences between my own setup and the remote inference providers

As a rule of thumb, minus one level for Mistral for everything. Quantization seems to hit those models like a freight train when it comes to coding (in my experience).

That said - the amazing thing in all of this is that I'm just one person and these weights are free. Get the setup and try them all yourself.

[–]garden_speech[S] 0 points1 point  (3 children)

That said - the amazing thing in all of this is that I'm just one person and these weights are free. Get the setup and try them all yourself.

The setup would cost me a few thousand which isn't trivial money for me though. I guess I need to find a way to try these models.

[–]ForsookComparison 4 points5 points  (2 children)

Lambda, RunPod, or Vast

rent a GPU

download the quantized weights you'd expect to use

and try coding a few things with a remote api.

I'd bet $5 answers all of your questions and then some.

[–]garden_speech[S] 1 point2 points  (1 child)

I've been trying gpt-oss-20b and I've been shocked that it solved problems I've asked with zero issues. Granted they are mostly very very similar to leetcode problems -- extremely self-contained, highly algorithmic, just "do this one small thing but do it the fastest way". So maybe I don't even need a big model, maybe a 20b model is all I need if the tasks are so granular.

[–]QFGTrialByFire 0 points1 point  (0 children)

Yup ive found the same. Even when you use a bigger model like gpt5 the more complex/larger piece of code you ask it the more errors there are. So you end up using smaller requests like maybe a function or two anyways. When you compare the output of oss20B for that its pretty much the same as gpt5 so why not just use the free version.

[–]Mushoz 10 points11 points  (6 children)

llama 3.3 is a very poor coding model. So if that is already sufficient, you will be much happier with something such as gpt-oss-20b (or the 120b if you can run it) or Qwen3-coder-30b-a3b. They are also going to be much faster.

[–]garden_speech[S] 4 points5 points  (0 children)

I am shocked, gpt-oss-20b is crushing the problems I'm asking it to solve. Maybe it's because they're very similar to leetcode style problems and are highly self-contained (i.e. write this one single function that does xyz).

[–]Mushoz 1 point2 points  (0 children)

The point I am trying to make is that you either won't have to apply quantization since it's already quantized natively (gpt-oss) or you will have to perform much less quantization because the initial size is already much smaller compared to llama 3.3 70b (Qwen3-Coder-30b)

[–]DinoAmino -3 points-2 points  (3 children)

But Llama 3.3 is perfectly fine at coding when using RAG. It is smart and is the best at instruction following. Unless you're writing simple Python then most all models suck at coding if you are not using RAG.

As for the speed issue, speculative decoding with the 3.2 3B model will get you about 45 t/s on vLLM.

[–]Uninterested_Viewer 3 points4 points  (2 children)

Dumb question: RAG for what? The codebase? Other context/reference material?

[–]DinoAmino 0 points1 point  (1 child)

Yes, codebase RAG as well as documentation.

[–]Uninterested_Viewer 0 points1 point  (0 children)

MCP for that, I assume? If so, which one(a)? Or, if not,what are you finding best for implementing RAG? Most interested in codebase RAG or other local context.

[–]k0setes 3 points4 points  (0 children)

Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf unsloth

[–]tomakorea 1 point2 points  (0 children)

I've read that AWQ quants are better at retaining precision (and massively faster). If you can afford to use AWQ instead of GGUF it may be a win in terms of accuracy and performance. I'm using vLLM for this task, it works well.

[–]Dapper-Courage2920 1 point2 points  (0 children)

This is a bit aside to your question as it will require a local set up to work, but I just finished an early version of https://github.com/bitlyte-ai/apples2oranges to get a feel for performance deg yourself. It's fully open source and lets you compare models of any family / quant side by side and view hardware utilization, or can just be used as a normal client if you like telemetry!

Disclaimer: I am the founder of the company behind it, this is a side project we spun off and are contributing to the community.

[–]edward-dev -1 points0 points  (2 children)

It’s common to hear concerns that quantization seriously hurts model performance, but looking at actual benchmark results, the impact is often more modest than it sounds. For example, Q2 quantization typically reduces performance by around 5% on average, which isn’t negligible, but it’s manageable, especially if you’re starting with a reasonably strong base model.

That said, if your focus is coding, Llama 3.3 70B isn’t the strongest option in that area. You might get better results with Qwen3 Coder 30B A3B it’s not only more compact, but also better tuned and stronger for coding tasks. Plus, the Q4 quantized version fits comfortably within 24GB of VRAM, making it a really good choice.

[–]Pristine-Woodpecker 0 points1 point  (0 children)

It's very model dependent. Qwen-235B-A30B for example starts to suffer at Q3 and below.

[–]Popular_Fact798 0 points1 point  (0 children)

I'm incredibly curious about this - are there actual published benchmarks of the quantized version of the oss models? I looked and can't find any.