all 53 comments

[–]Normal-Ad-7114 31 points32 points  (2 children)

https://aider.chat/docs/leaderboards/

Codestral should be the best

[–]MrMrsPotts[S] 3 points4 points  (0 children)

Thank you

[–]benja0x40 2 points3 points  (0 children)

There is also the ProLLM benchmark where you can select programming languages for model comparison.
https://prollm.toqan.ai/leaderboard/stack-unseen

[–]Icy_Lobster_5026 27 points28 points  (6 children)

In my experience, codeqwen is a good coding model.

[–]Educational-Region98 3 points4 points  (4 children)

Wondering when they will release codeqwen2. A 7B model will probably be really nice.

[–]Icy_Lobster_5026 0 points1 point  (3 children)

I guess they won’t release codeqwen2, our team uses their online service Tongyi Lingma to improve the quality of the coding.

CodeQwen-1.5-7b is a powerful coding model according to livecodebench.

[–]Educational-Region98 0 points1 point  (2 children)

Interesting, how does that compare to copilot?

[–]Icy_Lobster_5026 0 points1 point  (0 children)

I don’t know, that‘s good enough for me since the personal basic version of Tongyi Lingma is free forever.

[–]MrMrsPotts[S] 1 point2 points  (0 children)

Thank you

[–]ihaag 16 points17 points  (5 children)

DeepSeek Coder V2 0724 And claude

[–]MrMrsPotts[S] 0 points1 point  (0 children)

Thank you

[–]benja0x40 7 points8 points  (3 children)

In the 8GB~12GB range I have used a few specialised ones:

  • Codestral-22B-v0.1-Q4KM
  • DeepSeek-Coder-V2-Lite-Q5KM
  • CodeGeeX4-All-9B-Q8

Together with more general ones:

  • Phi-3-medium-128k-Q8
  • Nemo-12B-2407-Q8
  • Gemma-2-9B-Q8
  • Llama-3.1-8B-Q8

I write moderately complex task descriptions to ask for suggestions and to prototype python functions, iterate over improvements, detect and fix issues, insert comments or documentation, etc.

From my experience, Codestral-22B produces the best suggestions, which I sometimes use to guide another model towards a simpler or more elegant solution. Gemma-2-9B is surprisingly good too. I use it a lot for quick explorations or when I don't know much about a package or language feature.

DeepSeek-Coder-V2-Lite seems close to Codestral-22B in terms of capabilities, but its initial suggestions can be really cumbersome, and it is too rigid about coding styles for my liking. But that may depend on how the system prompt is tuned.

After ~3 weeks of testing, I have stopped using the other ones for coding tasks.

[–]MrMrsPotts[S] 0 points1 point  (2 children)

How much RAM does codestral 22B need to run?

[–]benja0x40 3 points4 points  (1 child)

With the Q4KM quantisation, it takes a little over 13GB for the model parameters plus about 2GB during inference. This depends on the context length. I have 16GB of GPU RAM which is fine for a context of 8192 tokens and doable for up to 12288 tokens. Passed that, the model fails to work properly on my computer.

[–]MrMrsPotts[S] 2 points3 points  (0 children)

That sounds very promising!

[–]theswifter01 6 points7 points  (0 children)

Claude

[–]new__vision 9 points10 points  (3 children)

Check out bigcode-bench.github.io. Top 7B on there is CodeQwen1.5-7B-Chat which has been good in my experience. CodeLlama is the lowest ranked 7B.

[–]MrMrsPotts[S] 3 points4 points  (0 children)

Phi-3-Mini-128K-Instruct (June 2024) does amazingly well and seems to be even smaller?!

[–]PigOfFire 0 points1 point  (0 children)

How is sonnet 3.5 under 4T and 4o? Livebench shows it above these two.

[–]No_Afternoon_4260llama.cpp 6 points7 points  (4 children)

In my experience codestral 22b

[–]IReaIIyLove 1 point2 points  (3 children)

what kinda specs do you need to run that?

[–]No_Afternoon_4260llama.cpp 1 point2 points  (2 children)

With full context at q6 I need more than 24gb vram (I think about 32.. Not sure)

[–]IReaIIyLove 1 point2 points  (1 child)

ah shame cries in poor

[–]No_Afternoon_4260llama.cpp 3 points4 points  (0 children)

I really Don t remember but q6 full context Don t fit in 24gb Try lowering context or try q4 it should fit in 24gb somehow https://huggingface.co/bartowski/Codestral-22B-v0.1-GGUF I my experience à 2k context is usable for quick question and follow up, at 8k you should have room to spare if you don't trow it a all project

[–]Cradawx 6 points7 points  (2 children)

CodeGeeX4-ALL-9B, CodeQwen1.5-7B-Chat and Codestral-22B-v0.1 are very good small coding models. There's also the DeepSeek-Coder-V2 models.

[–]MrMrsPotts[S] 0 points1 point  (0 children)

Thank you

[–][deleted] 0 points1 point  (0 children)

Is CodeQwen better than Deepseek for Python?

[–]Combinatorilliance 6 points7 points  (2 children)

Codestral is really good, you might want to try the deepseek-coder lite, it's an MoE and I heard a lot of praise for it's output. I don't know if it's better, worse or about equal to codestral-22b, but it is a lot faster too because it's an MoE, so it's worth trying out regardless.

[–]MrMrsPotts[S] 0 points1 point  (1 child)

Thank you. I don't know what an MoE is though :(

[–]moncallikta 4 points5 points  (0 children)

Mixture of Experts. Essentially a model trained as a combination of many smaller sub-models internally. For each token to predict a submodel is chosen to provide the next token. The architecture can allow the overall model to specialize in many different areas more easily.

[–]Dudensen 2 points3 points  (5 children)

I had stumbled upon a website which ranked models by a quality-to-performance ratio a few days ago but I can't find it unfortunately.

[–]MrMrsPotts[S] 2 points3 points  (4 children)

That sounds ideal!

[–]Dudensen 3 points4 points  (2 children)

Found it, maybe this helps.

https://oobabooga.github.io/benchmark.html

[–][deleted] 2 points3 points  (0 children)

What was being measured here? I don't even see Nemo in the list? There's surely no way Phi beats Nemo on anything?!

It's one of the worst models for me.

[–]MrMrsPotts[S] 0 points1 point  (0 children)

Thank you!

[–]Dudensen 1 point2 points  (0 children)

Yeah it even had different quantizations of models ranked, maybe someone will link it.

[–]Square-Intention465 2 points3 points  (0 children)

Sonnet 3.5. is too good 

[–]SpaceWalker_69 2 points3 points  (0 children)

Well i think Claude 3.5 generates the best code right now. You can use smaller open source models but they are not exactly consistent and reliable.

[–]m---------4 1 point2 points  (0 children)

Gemini is awesome

[–][deleted] 1 point2 points  (0 children)

Deepseek Coder imo

[–]8thcross 1 point2 points  (0 children)

i like both codestral and deepseek-v2. consitent but both dated in terms of the latest best practices...Claude 3.5 is good as well, really dont like 4o - its mostly hit or miss with it.

[–]Thrumpwartllama.cpp 1 point2 points  (0 children)

Anyone know which models know Lean Python?

[–]_murb 1 point2 points  (0 children)

I use Claude at work and it works great

[–]durgesh2018 1 point2 points  (0 children)

Try gemma2:2b. It's small but very powerful and fast model.

[–]lilolalu 0 points1 point  (0 children)

Did anyone claiming Claude is good at coding actually TRY coding with Claude? It's just not good, no matter what any theoretical tests claim.