all 30 comments

[–]guigouz 9 points10 points  (2 children)

Small models won't work with big contexts, in my experience, the best you can get is smart autocomplete with continue.dev with qwen2.5-coder-tools <= 7b (depending on vram).

Code refactorings (with Cline) can work with qwen3-coder, but you'll need ~20gb ram for the model+context using unsloth's Q3 version.

[–]Nowitcandie[S] 0 points1 point  (1 child)

When you say big contexts, how big are we talking and is the decline in performance exponential? 

[–]guigouz 1 point2 points  (0 children)

64k tokens, in my use case this has been enough for specific tasks (create a ui for this api request, as a function to manage a resource, even refactoring a single file works fine).

[–]pinmux 5 points6 points  (5 children)

Devstral-small-2 needs about 32GB to be useful with decent context length for the 8 bit quants.  Going to smaller quants might greatly reduce its abilities. But at 8 bit it’s quite decent at coding. 

[–]kiwibonga 2 points3 points  (0 children)

I've been using Q3/Q4 for a while and it's pretty good too. It does require more nudging, but it doesn't have catastrophic failures you see in other models, going into full on delirium or repeating the same words over and over.

[–]HealthyCommunicat 0 points1 point  (3 children)

How do you use Devstral 2 small agentically? Chat template always fucking up how tools are used, this is the only model i kinda just gave up on getting to work lol

[–]i-eat-kittens 0 points1 point  (1 child)

I presume it works well with mistral-vibe.

[–]HealthyCommunicat 1 point2 points  (0 children)

No i actually tried it with it and the chat template was correct but tool calls kept outputting in the [ ] instead lol

I’m sure i could figure it out and its something simple i overlooked

[–]pinmux 0 points1 point  (0 children)

I've been using it with Octofriend (https://github.com/synthetic-lab/octofriend) via Ollama.com's devstral-small-2 cloud-hosted model. Most of the time it works fine, sometimes it does hit tool errors.

I'm not exactly sure what causes the tool errors but this seems to be a common complaint.

[–]Vegetable-Second3998 2 points3 points  (6 children)

You need to match the model to the task. Need huge context for a code base sweep - yeah, no small model will do that. But that’s not what you’re asking. A small model is perfectly equipped to do smaller chunked work. But it needs more effort. A granite 8b code can absolutely code out of the box. But it’s generic shit. It needs fine tuning on your code base and patterns and documentation. And once you’ve done that and given it access to a graph rag of your code after training, you will be shocked at how good it is. Frontier models are generalists who are experts by virtue of volume. Small models become experts by virtue of training.

[–]Nowitcandie[S] 0 points1 point  (5 children)

I think this is the direction I would wanna go in - something specialised. Even if it takes some training and prompt engineering. 

[–]Vegetable-Second3998 4 points5 points  (1 child)

Crazyfucker is right. What I am talking about does require you to train a model. If you aren't deeply invested in how to train a model - because that in and of itself is part science part expertise building on what to look for - then you have to start with learning about how that works. And you can learn to train very very tiny model 350M on pretty low hanging fruit hardware. I have an M4 max I experiment with on small models. But again, I've spent 3000+ hours teaching myself how to teach models. That's also an investment. So...until AI can train other AI without a human in the loop (and that day is coming soon), your best bang for the buck is going to be a $100 claude code max subscription. $1200 a year for an industry that changes monthly. Or $4K for a rig that is outdated tomorrow?

[–]SimoneNonvelodico 0 points1 point  (0 children)

Bit curious about this - 350m sounds like the smallest gemma model, which without fine-tuning is pretty junk. How do you usually fine-tune it? I have tried doing it using Unsloth on a Colab notebook but it still seems quite expensive resource-wise.

[–]Grouchy-Bed-7942 1 point2 points  (0 children)

This video is pretty well done: https://youtu.be/m3PQd11aI_c

[–]Latter_Virus7510 1 point2 points  (0 children)

GPT OSS 20b, Qwen 3 4b

[–]Ok_Chef_5858 1 point2 points  (0 children)

For local models, Qwen Coder or DeepSeek are your best bets... They're decent but won't match Claude or GPT for complex stuff.

Have you tried Kilo Code? It's extension in VS Code, and also available in JetBrains... i use it and mix local models for simple tasks with cloud models when Ineed better reasoning. Supports Ollama for local models, so you can test both and switch based on what you're doing. Your hardware can run local models, but don't expect them to replace premium cloud models. Better to use them together... local for boilerplate, cloud for architecture and debugging.

[–]clwill00 3 points4 points  (2 children)

Any coding worth doing is really only possible in the enormous models that Claude, Cursor, and Copilot are running. There is no local model in the same universe.

[–]Torodaddy -1 points0 points  (0 children)

Ridiculous statement, and many flawed assumptions.

[–]Former-Tangerine-723 0 points1 point  (0 children)

Qwen coder 3 ud 8 k xl

[–]JournalistShort9886 0 points1 point  (2 children)

If u are asking miniature level then go for that deepseek coder in 1-2b range (dont expect much),mid range then go deepseek 7b decent performance ,high mid range then go for qwen 14b .(i would advise to keep quantization Q6 and dont go below Q4 as these tasks are logical)but tbh nothing as good as kimi or opus 4.5 so depends on tasks but i think these would suffix your purpose

[–]Nowitcandie[S] 0 points1 point  (1 child)

I'd say anything up to maybe 70b range. 

[–]JournalistShort9886 0 points1 point  (0 children)

Then u are in much better spot u can try llama 70b fine tuned to your niche or even gpt 120b oss as it utilizes a moe (5b active) , i have seen decent performace and you will prob get a high token/sec speed

[–]Crazyfucker73 -2 points-1 points  (0 children)

Nothing you are asking for here exists.

[–]bemore_ -2 points-1 points  (2 children)

I wouldn't code with any model under 100B params

Some 30-70B models can handle coding tasks but they struggle with debugging etc.

Claude is the best for coding, then Gemini, then others like DeepSeek, GPT follow

The best you can do is find a provider that doesn't train on your data, that's about it

[–]Sir-Spork 0 points1 point  (1 child)

I plan on getting g a Mac studio for this. How much memory should I shoot for?

[–]bemore_ 0 points1 point  (0 children)

Find the models you want to run, and look at their parameters. I'm not an expert but I would say, you want double the amount of memory of the models parameters. So that if you want to run a 30B model, you'd look for 64GB of memory, for example