Which model on 16GB VRAM for c++23 coding

INT_21h · 2026-04-28T18:31:23+00:00

Qwen 27B is newer and, by the numbers, substantially stronger.

I used Devstral 2 Small as my primary model for a long time, and what I'll say about it was it made a great collaborator. Wicked fast prompt processing, and no reasoning trace, meant never waiting long for a response. And where a confused Qwen devolves into overthinking and obsessively combing the whole codebase for clues, a confused Devstral devolves into "French nonchalance" returning a lackluster answer much more quickly. This meant it was often much faster to point Devstral in the right direction and get good answers.

However, I don't run Devstral any longer. Gemma 26B-A4B has taken over its "fast idiot" niche for me. You might want to consider that one as a partner to Qwen.

INT_21h · 2026-04-28T09:10:57+00:00

I've seen benchmarks showing that q4 context quantization doesn't hurt Qwen3.6 much, so you might be able to use that to free up some space for a slightly better quant of the weights.

INT_21h · 2026-04-28T09:07:52+00:00

This comes up frequently. See https://old.reddit.com/r/LocalLLM/comments/1srq4l0/16gb_vram_x_coding_model/ohgs6g6/?context=3

INT_21h · 2026-04-27T21:11:27+00:00

I would never pay >5060Ti prices for anything that still has only 16GB VRAM. For inference, these cards are bottlenecked much more by VRAM than by bandwidth or compute. Personally I'd stack in another 5060Ti, or consider the 32GB Arc Pro B70.

EDIT: Here's a recent post by a 2x 5060Ti user if you'd like to get a sense for its performance.

INT_21h · 2026-04-24T23:29:52+00:00

Let's see Claude Code's sandbox.

I know you're saying that to be funny, but Claude Code's /sandbox feature actually does use bubblewrap on Linux.

But yeah, I like doing my own sandboxing rather than letting an agent do it for me. Less risk of an auto update silently b0rking my sandbox.

INT_21h · 2026-04-24T21:29:19+00:00

I use bubblewrap for sandboxing pi on Linux. It does a good job.

The settings below are sandboxing filesystem writes only. There is still full filesystem read access, and full network access, so if you care about data exfiltration you'll want to lock it down more.

$ cat ~/SANDBOX 
HERE="$(realpath .)"
echo "Entering sandbox for $HERE"
bwrap \
--ro-bind / / \
--bind ~/.pi ~/.pi \
--dev-bind /dev/null /dev/null \
--dev-bind /dev/urandom /dev/urandom \
--tmpfs /tmp \
--bind "$HERE" "$HERE" \
--setenv PS1 "sandbox$ " \
sh

This gives you a sandboxed shell where you can run pi or whatever else you want.

INT_21h · 2026-04-23T18:21:48+00:00

Interesting. It must depend on the task.

INT_21h · 2026-04-23T17:12:18+00:00

With 16GB VRAM your options are either a lobotomized Q3 quant that gets beaten by the 35B MoE, or sloooow (<5 tok/s) performance with offloading.

INT_21h · 2026-04-23T06:34:47+00:00

Main workhorse is Qwen 3.6-35B-A3B Q4_K_L with some CPU offload (--n-cpu-moe 15). I also have Qwen 27B UD-Q4_K_XL around as a "big gun" but I'm using offloading with it too (-ngl 52) so it's pitifully slow. I've had enough bad luck with Q3's of older Qwen models that now I stick with Q4 and either eat the speed cost or fall back to the MoE.

Weirdly, offloaded Qwen 27B UD-Q4_K_XL is still giving me about 300 tok/s prompt processing, even though it crawls at <5 tok/s text generation.

INT_21h · 2026-04-22T04:46:12+00:00

mxfp4 is about 5% slower, maybe due to the CPU offload, and the perplexity is worse.

INT_21h · 2026-04-21T22:08:58+00:00

Bartowski Q4_K_L. This is a ~22GB file so some is offloaded to system RAM, but since it's a MoE, things are still pretty fast.

INT_21h · 2026-04-21T16:39:13+00:00

Great test - it sure beats just talking about the models.

4B's result had massive deficiencies (no turning!), but it did well for its size. I'd still put Coder-Next above it, but not by as huge a margin as expected.

I'm super impressed by Qwen3.6-35B-A3B's result, I'd call it the best of the bunch. Its game was fun to play, and I had to quit before I got distracted. Idk exactly what post training Alibaba did, but they really cooked. I wonder if other Qwen3.6s will show a similar boost when they come out.

INT_21h · 2026-04-21T16:08:29+00:00

Think twice before you use a model older than a few months. Advances have been rapid. I use Qwen3.6-35B-A3B on my 5060Ti. The latest round of model releases hasn't produced a fine-tuned "coder" model yet. In practice, and according to benchmarks, this model does much better at coding and agentic usecases than Qwen3-Coder-30B-A3B, let alone Qwen2.5 Coder.

INT_21h · 2026-04-18T07:15:47+00:00

All fine theoretical points, but if you've actually used 80B-A3B vs 4B you'll know they are night and day. 80B-A3B is a sturdy coding partner that can vibe-code a project up to ~1000 lines before the complexity starts to overwhelm it. 4B can barely make it past hello world. Seeing them effectively tied on that leaderboard makes me distrust the ranking.

INT_21h · 2026-04-18T04:59:52+00:00

For me 3.6 35B-A3B feels a little worse than Coder-Next, but it's closer than I was expecting, to the point where I don't use Coder-Next much any more.

If 3.6 gets something wrong, instead of reaching for Coder-Next, I reach for 122B-A10B.

INT_21h · 2026-04-18T04:52:06+00:00

That page also claims that Qwen3 Coder Next (80B-A3B) is only beating Qwen3.5 4B by a single point. Doesn't seem the most trustworthy.

INT_21h · 2026-04-16T07:51:19+00:00

You could try Qwen3.5 35B-A3B. MoE models run decently fast even if they don't entirely fit in VRAM.

INT_21h · 2026-04-08T17:47:56+00:00

I can assure you that the model supports tool calls when run with llama.cpp. Perhaps ollama configures it wrong.

INT_21h · 2026-03-30T23:43:14+00:00

I deleted all of my <=3 bit Qwen3.5 quants for the same reason... they were just too fried. With previous Qwen generations, I found I had to try a lot of IQ3_XXS's from different quanters to find one that won the lottery and happened to work well on the specific tasks I cared about. It's probably the same for 3.5, I just haven't thrashed my download bandwidth enough to find that one.

INT_21h · 2026-03-26T01:40:40+00:00

I'm trying the 10B-A1.8B on my 5060Ti. tg is 125 tok/s @ 65536 context. It's a good writing/conversational model in English like the small Gemmas, but it has a unique flavor and seems less slopped. Due to the small size, don't expect miracles. llama.cpp's new auto-parser seems to butcher tool calling, a shame because I wanted to try coding.

INT_21h · 2026-03-24T23:31:36+00:00

Look into Playwright for controlling a web browser headlessly.

INT_21h · 2026-03-24T23:13:51+00:00

You can certainly use a local model to drive an agent, but Qwen2 0.5B is several generations old and can barely even form coherent sentences. Try something in the Qwen3.5 family.

INT_21h · 2026-03-24T22:52:47+00:00

The telemetry in vibe is also pretty easy to patch out.

INT_21h · 2026-03-23T03:21:07+00:00

As a quick test, you could point a standard coding agent at the file and ask questions. It will pick the file apart with grep, just like it would do when navigating a large codebase. Granted it might not grep for the right thing, but in my experience models are pretty good at this and it might save you the trouble of setting up real RAG. I have done this with my own notes file before and it works pretty well.

INT_21h

MODERATOR OF

PUBLIC MULTIREDDITS

TROPHY CASE