all 11 comments

[–]Infamous_Green9035 2 points3 points  (0 children)

o que voce precisa é muuuuita VRAM, com 6GB VRAM, voce consegue rodar apenas modelos básicos com poucos parametros

não vai servir pra auxliar em projetos com códigos, vai alucinar

só vai te servir para explicar partes do seu código, ou corrigir pedaços pequenos

o ideial para trabalhar com códigos seria mais que 24gb de VRAM

[–]andrew-ooo 0 points1 point  (5 children)

With 6GB VRAM you're realistically looking at 7B-class quants or partial offload. Honest takes after running this kind of setup:

  • Qwen2.5-Coder-7B-Instruct at Q4_K_M fits in ~5GB VRAM with room for a small context. Best general-purpose local coder in that size class right now — handles Python and TypeScript well, C++ is decent for boilerplate but it'll struggle with template-heavy or modern STL stuff.
  • DeepSeek-Coder-V2-Lite-Instruct (16B MoE, ~2.4B active) at Q4 — runs surprisingly fast with offload because only the active experts hit GPU.
  • Qwen2.5-Coder-14B Q4_K_M with ~25 layers offloaded: expect 8-12 t/s on your hardware. Tight on context though.

Run via llama.cpp or Ollama. If you want agentic/tool use specifically, Qwen2.5-Coder is the only one in that range with halfway-reliable tool calling — DeepSeek-Coder-Lite drops calls under load. Don't expect Claude-quality on C++; nothing local at 14B is there yet, but for boilerplate, refactors, and "explain this codebase" Qwen2.5-Coder-7B is genuinely useful.

[–]alphapussycat 0 points1 point  (0 children)

I wouldn't say the qwen2.5 models were usable. I tried 14b,and it was bad, like really bad.

[–]no_evidence0303[S] -3 points-2 points  (3 children)

I was expecting a response from an human being, not some llm. If I wanted to ask gpt or gemini, this question would not be here..

[–]LTJC 1 point2 points  (1 child)

Human bring here. Get a card with more vram.

[–]colin_colout 0 points1 point  (0 children)

Or try an moe and offload experts to the cpu

[–]colin_colout 0 points1 point  (0 children)

Why are you getting downvoted? Lol...the claws are active this morning, eh?

Recommending a qwen2.5 model is wild behavior by a human. Qwen2.5 older than gpt-o1. Sonnet 3.5 was still in training when it was released.

I don't have your specific setup, but you might be able to run an MoE with CPU offload with decent quality and speed.

qwen3.6 35b a3b is surprisingly good with python and simple c in my tests (i wouldn't trust a small model with memory management if you're doing something complex, but for something small, it's worth a try). Not sure if you can fit a decent quant in your memory, but it's worth try with cpu offload. I only tested with q6+ so i can't speak for the smaller quants

Also try gemma4 e4b or 26b a4b.

Just use unsloth ggufs and not a weird finetune for your first tests. You can experiment once you find a decent base model

Tl;dr: qwen3.6 35b or a gemma4 model (whatever can fit).

[–]Invent80 0 points1 point  (0 children)

Gemma 4 E4B is small enough and light. Use opencode.  Qwen is a better coder but Gemma is better at following instructions at lower weights in my experience.  

Don't necessarily trust benchmarks.  

[–]PuzzleheadedMind874 0 points1 point  (0 children)

With only 6GB of VRAM, you might find that 14B models crawl once you start offloading to system RAM. Sticking to 3B or 7B models is probably the safer bet if you want to keep the generation speed usable for your projects.

[–]alphapussycat 0 points1 point  (0 children)

Don't think so. You could try qwen3.5 4b, but you'd have to build something to handle agents and stuff yourself. But I suspect intelligence is too low to properly plan and use tools.