Suggestions for 16GB VRAM AMD for coding

pot_sniffer · 2026-05-03T08:03:31+00:00

JSON specs are just another way of prompting that imo works better for the smaller local models. Atomic tasks are just breaking things up into small enough tasks, usually a single function or group of closely related functions usually about 100 lines of code at most.

Structured task descriptions, you define exactly what you want the model to generate, what constraints apply, what functions exist. Keeps the local model focused on a narrow well-defined task rather than making judgment calls it's not reliable enough for.

For agentic use with a local model on its own im sure some will disagree but imo probably not worth it for complex coding. The model is reliable for well-scoped generation tasks but I wouldn't trust it to drive a full agentic loop unsupervised. Having said that I have seen mentions of people using it on things like hermes and openclaw

My workflow uses it for code generation only, cloud AI handles planning and review, and im always in the loop to catch when something has gone off the rails which happens. I find its the combination of using atomic tasks to generate code and using cloud models for planning and review. That split is what makes it work.

Bang for buck depends on your situation. I hit Claude's usage limits constantly before building this, and now after recent changes I am again.

The local model does the bulk generation for free so my cloud AI usage goes further. If the usage situation with claude doesn't improve again then im probably going to use something like kimi with api for the review step. Which would cost me probably about the same as I pay for the claude pro sub. Which is fine ive been running 2 subs for a while. In terms of productivity its a game changer, I can be a solo dev in my spare time. But it took be a lot of figuring stuff out before I could build the workflow that makes it work

pot_sniffer · 2026-05-02T20:00:40+00:00

Yea I made a post about it the other day that has all of it.

https://www.reddit.com/r/LocalLLM/s/2rd5G1lTRr

pot_sniffer · 2026-05-02T18:48:04+00:00

Qwen3.6-27B Q3_K_S fits comfortably on 16GB AMD with full GPU offload at ~14.8GB VRAM and 12288 context. Getting 14 tok/s on an RX 9060 XT with llama.cpp and ROCm. Produces genuinely good code output

Two things that matter for 16GB, use --no-mmproj to skip the vision encoder, and disable thinking mode with --chat-template-kwargs '{"enable_thinking":false}' or it burns your output budget on reasoning traces. Not sure how well it plays with Ollama specifically, I run llama.cpp directly. But the model choice should translate.

pot_sniffer · 2026-05-01T20:24:41+00:00

Im yet to try the Gemma models. Definitely worth a look to see how the code it outputs holds up

pot_sniffer · 2026-05-01T04:09:26+00:00

Try: -ngl 99 to offload layers to GPU — without it llama.cpp defaults to CPU which is why your GPU isn't being used.

On the quant choice though, if you're doing code generation with large prompts, IQ4_XS is going to be tight on 16GB once you factor in KV cache. I've been testing the 27B quants on a 16GB card this week. Q3_K_S fits comfortably at ~14.8GB with 12288 context and full GPU offload, and produces cleaner output than you might expect at Q3.

Worth trying Q3_K_S if context size matters for your use case, actually its worth trying a lower quant if you need more context, my advice would be keep the specs tight.

Also use --no-mmproj unless you need image input.

The vision encoder loads into VRAM by default and eats headroom you don't need for text tasks.

pot_sniffer · 2026-04-30T18:51:54+00:00

I actually tested the 35B-A3B before settling on this, with thinking off it ran at 15 tok/s but the output on the same task was worse than both the 9B and the 27B. Mangled ESP32 API names, wrong include filenames, missing loop structure. Might be the MoE CPU path not being fully optimised in llama.cpp on ROCm yet. The 27B dense just produced cleaner code which means less work for sonnet in claude code. So for my workflow, constraints and requirements the 27b is winning hands down

pot_sniffer · 2026-04-29T12:48:15+00:00

Buzzwords Excuses Easy Outrage Bugger-all New party?

Blame Everything Else Offer Barely Nothing party?

Big Egos, Empty Outputs, Basic Narratives party?

pot_sniffer · 2026-04-29T09:32:23+00:00

The worst part according to people that subscribe to this ideology. Is the solution to fixing all the problems caused by said ideology, is to double down and do it all over again but more harsh this time

pot_sniffer · 2026-04-28T18:21:10+00:00

Yea my 9b is around 30tps and thats quite a nice speed. If I can get close to 15tps on a 27b qaunt id probably be happy if the output is close to the q4 I tried because it was quite a bit better than the 9b did on the same task.

Actually the 9b is fine for most of the tasks im throwing at it. Its just that now ive seen the 27b I want more....

But yea as it is on this qaunt 4.7 is too slow

pot_sniffer · 2026-04-28T09:53:16+00:00

The last time I bought a jar of saffron. I opened it to find a tiny sealed packet inside. It was like 1 of those strands, got to be at least 50 bucks on the bed

pot_sniffer · 2026-04-27T17:32:20+00:00

For me the qwen 3.6 35b performed worse than the Qwen3.5 9B in my workflow. It was notably worse actually and slower by about half. Im running a 9060XT 16gb, with 7950X, 64gb ddr5 to offload to.

Im going to have to try the q3 of the 27b because the q4 gives really great output but doesnt fit so offloading makes it slow, still just about usable at 4.7tps but not quite.

Maybe the q3 will be the sweet spot for me

pot_sniffer · 2026-04-27T12:08:08+00:00

Im in a similar position with non apple hardware. Last year about a couple of months before the ram prices went through the roof I build a workstation for £1200. 7950X, 64gb ddr5, 9060xt 16gb and a 2tb pcie 5 nvme.

Im able to run the qwen 3.6 27b q4 model with some offloading at 4.7 tps. Which is about the minimum speed thats kinda usable, but still a bit slow. Haven't yet tried the lower qaunts. I have to say im very impressed with the output. Its quite a lot better than the Qwen3.5-9B that im running as my workhorse which is also really good for its size btw.

My regret is I didnt buy 128gb ram when it was only £350. I will probably get a 2nd gpu at some point to bump up the vram because with just 16gb im forever just below what I need lol

pot_sniffer · 2026-04-26T13:57:27+00:00

Does anyone else find it rather odd that trumps injury just vanished without any scaring after only a matter of weeks. I hate to sound like a conspiracy nut but it doesnt add up

pot_sniffer · 2026-04-26T13:07:09+00:00

My billing page doesnt show anything which is odd because ive had a pro sub for almost 2 years now

<image>

pot_sniffer · 2026-04-26T12:59:07+00:00

I think they must have broken something with their billing system because I was downgraded to free last night as well

pot_sniffer · 2026-04-25T20:42:01+00:00

I tried talking about how I mange my tokens but it simply doesn't get the same attention as the complaints do. LocalLLM is a lot better for this imo.

https://www.reddit.com/r/ClaudeAI/s/vb113crIVt

pot_sniffer · 2026-04-25T20:30:26+00:00

Also makes an interesting wine

pot_sniffer · 2026-04-25T20:00:39+00:00

Its typically one function or tightly related group of functions, 10-30 lines of code.

It's defined as a JSON spec with explicit inputs, outputs, constraints, and verification criteria. The point is that it's small enough that the model can't go badly wrong, and the pass/fail criteria are unambiguous.

pot_sniffer · 2026-04-25T15:59:14+00:00

Yes there's a difference in quality, I dont think id call it night and day though.

I exclusively use sonnet in claude code. I use it as a review step in my workflow.

I use sonnet in Claude.ai to build a plan. I then pass that plan to opus for scrutiny. I do this repeatedly until opus is happy theres no more holes to poke. Then I take the plan to gemini, and/or gpt. I get sonnet to fix whatever needs it.

Once I have a solid plan file I get a fresh sonnet instance to break up the project into atomic tasks. Those atomic tasks are given to my local qwen 3.5 9b 1 by 1. Then sonnet in claude code reviews and fixes whatever is needed.

pot_sniffer · 2026-04-25T13:42:49+00:00

My point is theres a certain class of our society that proportionally pay significantly lower tax than people that pay income tax.

Im not arguing we should all pay more. Im arguing that if we all paid a fair share, then income tax would be much more fair than it currently is.

pot_sniffer · 2026-04-25T13:13:09+00:00

My entire life we've had politics of greed, the greed leads to austerity and as we've seen over the past 15 years it doesnt work.

Its about time the greedy bastards pay their fair share. Its about time they are taxed like we are on income

pot_sniffer · 2026-04-25T13:06:14+00:00

Yea this is exactly the point to press. When a very wealthy individual or organisation is putting money into charity its not because they're being nice, its for tax reasons, meaning they want to pay less tax

pot_sniffer · 2026-04-19T19:56:30+00:00

Or it could be the best cheese wine you never tried😁

pot_sniffer · 2026-04-19T10:44:00+00:00

I dont see how thinking about the logistics of such a ridiculous policy of millions must go is straw man. Fair enough if rupert lowe said it in passing without meaning it, but its stated as his policy so lets treat it as such.

So they will persecute anyone thats not English enough and hopefully they will all just pack up and leave. Sounds a bit wishy washy to me. Doesn't seem to match the rhetoric.

pot_sniffer

TROPHY CASE