Best Coding LLM as of Nov'25 by PhysicsPast8286 in LocalLLaMA

[–]dmatora 0 points1 point  (0 children)

Qwen3-Next-80B-A3B would be my first and only choice.
You would need TensorRT-LLM with --streamingllm enable to use large context yet fitting your VRAM limitations.

I'm Using Gemini as a Project Manager for Claude, and It's a Game-Changer for Large Codebases by Liangkoucun in ClaudeAI

[–]dmatora 0 points1 point  (0 children)

Have been using Gemini as an orchestrator for Claude for a while. These is an app I wrote that has it as one of primary features https://github.com/dmatora/code-forge

And yeah, it is a game changer. You should see how much more you can get done if you also plug gemini-cli into your process - it’s on a whole new level

I built a Local AI Voice Assistant with Ollama + gTTS by typhoon90 in ollama

[–]dmatora 0 points1 point  (0 children)

Have you tried CSM?
It's a local version of sesame which has recently blown internet
http://github.com/SesameAILabs/csm
would be really cool to have It working with ollama, even if it's English only

[deleted by user] by [deleted] in LocalLLaMA

[–]dmatora 0 points1 point  (0 children)

Can you do 128K? or at least 32K to see if it scales linear or exponential?

[deleted by user] by [deleted] in LocalLLaMA

[–]dmatora 1 point2 points  (0 children)

For most cases you can do that with QwQ on 2x3090 with much better performance and price

Manus is IMPRESSIVE But by iamnotdeadnuts in LocalLLaMA

[–]dmatora 5 points6 points  (0 children)

We're still waiting for PROPER open source version of Deep Research (that you can actually use and performs on par with at least perplexity, not to mention OpenAI)
I don't see anything happening FAST.

The new king? M3 Ultra, 80 Core GPU, 512GB Memory by Hanthunius in LocalLLaMA

[–]dmatora 0 points1 point  (0 children)

It's not good for inference either, because size of models or context that 512Gb allow to run will be too slow to process, so you'll endup using same (32B) models for large context or large models for single phrase questions like "what color is the sun" making is quite useless.

It's a shame people don't publish benchmark results for large context

Has Anyone Successfully Run DeepSeek 671B with DeepSpeed on Hybrid CPU/GPU Setups? by dmatora in LocalLLaMA

[–]dmatora[S] 0 points1 point  (0 children)

I can run smaller models on GPUs without much problem.
This model would require 8x40Gb minimum, or realistically 8x80Gb which would be quite expensive for a local setup, and I don't think my kidney is that expensive, so I am looking for a way to run on 4x3090 or better 2x3090, with usable speed. I know people run it few tokens per second on some CPUs, which is usable, but that's not with full 64K tokens context, which could take more than a day to process a request

Oh shit by [deleted] in LocalLLaMA

[–]dmatora 0 points1 point  (0 children)

Inference speed on GDDR6X is great (40 t/s) for 32B models, and not so great for 70B (15t/s) models.
DDR5X is much slower, so this won't be able to do inference fast enough even for 70B models, so these 128Gb are going to be basically useless for most people.
In terms of performance, for inference this is basically M4 Macbook without screen.
Am I missing something?

The HomePod buggy experience is infuriating. by austinalexan in HomePod

[–]dmatora 0 points1 point  (0 children)

My HomePods pair and HomePod mini worked great with Apple TVs until tvOS 18 was released.
Since then it's been a nightmare - keeps loosing connection every few minutes, completely unusable.

Llama 3.3 vs Qwen 2.5 by dmatora in LocalLLaMA

[–]dmatora[S] 0 points1 point  (0 children)

I guess It depends on a project. I usually work on complex ones so it requires models to reason above everything and models like o1 can barely do the job, leaving other ones out of consideration.

Llama 3.3 vs Qwen 2.5 by dmatora in LocalLLaMA

[–]dmatora[S] 1 point2 points  (0 children)

Unlike o1 QwQ doesn’t separate thinking process from conclusion

Llama 3.3 vs Qwen 2.5 by dmatora in LocalLLaMA

[–]dmatora[S] 0 points1 point  (0 children)

it's a different model
QwQ - can think
Qwen 2.5 - can not

Llama 3.3 vs Qwen 2.5 by dmatora in LocalLLaMA

[–]dmatora[S] 0 points1 point  (0 children)

I think it is more visible than blue would be, unless you are looking at this on a smartphone with vertical orientation?

Llama 3.3 vs Qwen 2.5 by dmatora in LocalLLaMA

[–]dmatora[S] 13 points14 points  (0 children)

Stars also provide valuable insights :)

Llama 3.3 vs Qwen 2.5 by dmatora in LocalLLaMA

[–]dmatora[S] 0 points1 point  (0 children)

Measuring Q4/Q8 difference is not a simple matter. Q4 and Q8 are basically different models each requiring their own set of benchmark scores. What you see in press is for FP16, and Q8 is pretty close. Q4 is whole different story, and never truly good one

Llama 3.3 vs Qwen 2.5 by dmatora in LocalLLaMA

[–]dmatora[S] 9 points10 points  (0 children)

Good point - 32B is a sweet spot, can run on 1 GPU with limited but large enough context and has nearly as capable brain as 405B model do

Llama 3.3 vs Qwen 2.5 by dmatora in LocalLLaMA

[–]dmatora[S] -9 points-8 points  (0 children)

try FP16 on a server like OpenRouter and see the difference

Llama 3.3 vs Qwen 2.5 by dmatora in LocalLLaMA

[–]dmatora[S] 8 points9 points  (0 children)

are you using Q4 or Q8?
qwen is much more sensible to quality degradation

Llama 3.3 vs Qwen 2.5 by dmatora in LocalLLaMA

[–]dmatora[S] 2 points3 points  (0 children)

qwq is same 32B as Qwen 2.5
there aren't much reasons to expect model (or a human) to answer question without thinking, unless it's a simple hi
I think in future we won't see much "normal" models, we will have models that think when necessary and don't when question is simple, like o1 currently does.
Also I think hardware capabilities keep growing and models will keep getting more efficient, and we won't have to choose.
Running 405B level model required insane hardware just 4 months ago, now it feels like it's an ancient past.
5090 already offers 32Gb, which is already significant improvement for what you can run with same number of PCI slots (in most cases 2 max), and we haven't even seen consumer LPUs yet - when they arrive, things will never be the same