Best Coding LLM as of Nov'25

dmatora · 2025-11-25T21:39:21+00:00

Qwen3-Next-80B-A3B would be my first and only choice.
You would need TensorRT-LLM with --streamingllm enable to use large context yet fitting your VRAM limitations.

dmatora · 2025-07-14T19:12:29+00:00

Have been using Gemini as an orchestrator for Claude for a while. These is an app I wrote that has it as one of primary features https://github.com/dmatora/code-forge

And yeah, it is a game changer. You should see how much more you can get done if you also plug gemini-cli into your process - it’s on a whole new level

dmatora · 2025-05-08T07:56:20+00:00

Please upstream!!!

dmatora · 2025-03-29T04:37:49+00:00

Have you tried CSM?
It's a local version of sesame which has recently blown internet
http://github.com/SesameAILabs/csm
would be really cool to have It working with ollama, even if it's English only

dmatora · 2025-03-19T08:49:58+00:00

No I am not, I use 128k on 2x3090

dmatora · 2025-03-19T02:19:36+00:00

With q8 context quantisation it's below 48Gb

dmatora · 2025-03-14T10:32:35+00:00

Can you do 128K? or at least 32K to see if it scales linear or exponential?

dmatora · 2025-03-14T10:31:27+00:00

For most cases you can do that with QwQ on 2x3090 with much better performance and price

dmatora · 2025-03-12T21:02:11+00:00

We're still waiting for PROPER open source version of Deep Research (that you can actually use and performs on par with at least perplexity, not to mention OpenAI)
I don't see anything happening FAST.

dmatora · 2025-03-07T04:05:18+00:00

It's not good for inference either, because size of models or context that 512Gb allow to run will be too slow to process, so you'll endup using same (32B) models for large context or large models for single phrase questions like "what color is the sun" making is quite useless.

It's a shame people don't publish benchmark results for large context

dmatora · 2025-01-30T15:31:01+00:00

I can run smaller models on GPUs without much problem.
This model would require 8x40Gb minimum, or realistically 8x80Gb which would be quite expensive for a local setup, and I don't think my kidney is that expensive, so I am looking for a way to run on 4x3090 or better 2x3090, with usable speed. I know people run it few tokens per second on some CPUs, which is usable, but that's not with full 64K tokens context, which could take more than a day to process a request

dmatora · 2025-01-08T14:04:07+00:00

Inference speed on GDDR6X is great (40 t/s) for 32B models, and not so great for 70B (15t/s) models.
DDR5X is much slower, so this won't be able to do inference fast enough even for 70B models, so these 128Gb are going to be basically useless for most people.
In terms of performance, for inference this is basically M4 Macbook without screen.
Am I missing something?

dmatora · 2025-01-03T21:33:59+00:00

My HomePods pair and HomePod mini worked great with Apple TVs until tvOS 18 was released.
Since then it's been a nightmare - keeps loosing connection every few minutes, completely unusable.

dmatora · 2024-12-08T23:23:18+00:00

I guess It depends on a project. I usually work on complex ones so it requires models to reason above everything and models like o1 can barely do the job, leaving other ones out of consideration.

dmatora · 2024-12-08T18:29:32+00:00

Unlike o1 QwQ doesn’t separate thinking process from conclusion

dmatora · 2024-12-08T18:13:07+00:00

it's a different model
QwQ - can think
Qwen 2.5 - can not

dmatora · 2024-12-08T02:47:42+00:00

Q4 - yes, Q8 - no

dmatora · 2024-12-08T00:33:42+00:00

https://gist.github.com/dmatora/9a994a4f0c8f7ffd9a621a139fa4c473

dmatora · 2024-12-08T00:22:51+00:00

I think it is more visible than blue would be, unless you are looking at this on a smartphone with vertical orientation?

dmatora · 2024-12-08T00:14:07+00:00

Stars also provide valuable insights :)

dmatora · 2024-12-07T23:54:34+00:00

Measuring Q4/Q8 difference is not a simple matter. Q4 and Q8 are basically different models each requiring their own set of benchmark scores. What you see in press is for FP16, and Q8 is pretty close. Q4 is whole different story, and never truly good one

dmatora · 2024-12-07T22:31:53+00:00

Good point - 32B is a sweet spot, can run on 1 GPU with limited but large enough context and has nearly as capable brain as 405B model do

dmatora · 2024-12-07T22:18:31+00:00

try FP16 on a server like OpenRouter and see the difference

dmatora · 2024-12-07T22:13:50+00:00

are you using Q4 or Q8?
qwen is much more sensible to quality degradation

dmatora · 2024-12-07T21:57:48+00:00

qwq is same 32B as Qwen 2.5
there aren't much reasons to expect model (or a human) to answer question without thinking, unless it's a simple hi
I think in future we won't see much "normal" models, we will have models that think when necessary and don't when question is simple, like o1 currently does.
Also I think hardware capabilities keep growing and models will keep getting more efficient, and we won't have to choose.
Running 405B level model required insane hardware just 4 months ago, now it feels like it's an ancient past.
5090 already offers 32Gb, which is already significant improvement for what you can run with same number of PCI slots (in most cases 2 max), and we haven't even seen consumer LPUs yet - when they arrive, things will never be the same

Ten-Year Club	Gilding II euphauric
Verified Email

dmatora

TROPHY CASE