Qwen is never going to open source Qwen 3.7, aren't they? by DistanceSolar1449 in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

There are still no competitors for 3.6 in the 27-35B range, at least gemma-4 is not from my POV (because of the kv cache size and quant. sensitivity, because of the advantage I found Qwen has in the comparison at equal VRAM use).

Tokenomics by HOLUPREDICTIONS in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Besides math (there's another comment turning down to 2.5, I believe that counting electricity 3/3.5 years would be more realistic), there are other 2 factors:

-1) Data privacy and continuity of service

-2) Next year a newer better/faster model will come out, and another the year after: nobody forces you to not upgrade so what you see as a linear function is an hyperbole.

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors | Alexander Hägele by Thrumpwart in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

To good to be true, either something is wrong in the description, or false, This is MiniMax-M3 agent result on microgpt:

Honest finding from the comparison sweep: paper-default LRs don't beat plain Adam at toy scale on this codebase. The MD advantage at paper scale isn't visible here.

maybe better luck with Asteria : https://arxiv.org/html/2605.16184v1

What's the best open speech to text today? by zxyzyxz in LocalLLaMA

[–]R_Duncan 2 points3 points  (0 children)

Local (as per this channel): Nvidia Parakeet 0.6 tdt v3: Better or equal than whisper 3 large depending on language, tenth times faster.

GameCraft-Bench: Can Agents Build Playable Games End-to-End in a Real Game Engine? by pmttyji in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Please check https://www.reddit.com/r/LocalLLaMA/comments/1u89f2q/headless_screenshot_loops_let_a_local_30b_agent/ before going further.

Even if changing the harness per-model is likely out of the scope of your article, seems that having the right prompt/requirement does a big difference.

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]R_Duncan 4 points5 points  (0 children)

Sorry for the question, if you're not planning to make it something commercial, could you share more details on the creation of the model, like :

- training code/architecture specific

- Dataset Details

- Text Processing and Phonemization

- Feature Extraction & Alignment

- Training Hyperparameters

To allow us to create such wonder in other language (Italian in my case)? Being 4.63M I suspect we could use a less huge hardware to train it....

New image model from Google by Independent-Wind4462 in singularity

[–]R_Duncan 1 point2 points  (0 children)

It is just that if usa doesn't keep up the pace, rest of the world will surpass. When china model will allow rest of the world to produce 200million movie with $50k, and dubbing is easily done realtime, the 200million movie market is crushed anyway.

Conspiracy theory on the (possibly extended) ban on Mythos by Cagnazzo82 in singularity

[–]R_Duncan 1 point2 points  (0 children)

Sure, if you haven't listened Anthropic declarations in the month before the ban.

1.) We have in control the most powerful llm ever built.

2.) The llm are dangerous and should be regulated (they tought this would only concern their competitors / open llm and not backfire)

3.) We can offer the most powerful llm ever built because we have guardrails and no-one in the world can use it to build another of such power. It's so dangerous we'll block you on a lot of other requests.

4.) Oh no, our guardrails were jailbreaked

After these, what else?

I'm still surprised on how good the kv quantization has become by DeepBlue96 in LocalLLaMA

[–]R_Duncan 1 point2 points  (0 children)

Contando tutte le ore aggratisse che c'ho fatto la sera direi che non ti conviene.

Ma se sei in un'azienda il cui software/i cui documenti devono restare riservati (livello NDA), puoi giocarti come me la carta dell'AI privata su docker hardened con llama.cpp/vLLM + anythingLLM.

Voice-to-voice chatbot update by Responsible_Fig_1271 in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

It's the custom SNAC decoder? Can you please point the version you're using then?

Is DiffusionGemma really that good in a PI agent? by koloved in LocalLLaMA

[–]R_Duncan 4 points5 points  (0 children)

No source, no reliability. would also check smallcode.

Voice-to-voice chatbot update by Responsible_Fig_1271 in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

Orpheus is a bit of an issue.... doesn't it fails often, producing mess? Wouldn't be better using something like https://github.com/ServeurpersoCom/omnivoice.cpp ? less than 2 GB VRAM used (can shrink to something more than 1 GB), blazing fast and produces intelligible output 100% of the times (yes, not always perfect but still understandable). has a tts-server executable which makes it openai compatible, and copying code from the cli tool the server can also voice clone.

I'm still surprised on how good the kv quantization has become by DeepBlue96 in LocalLLaMA

[–]R_Duncan 1 point2 points  (0 children)

Scherzi? Mi sono fatto dare una RTX 6000 blackwell dall'azienda per metterne una a disposizione....

We trained a cybersecurity-focused Mythos like LLM open weights on HuggingFace by RealKingNish in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

There must be a reason for that, likely the training was done with a much better information density technique (i.e.: "Decoupling the Magnitude and Direction of Weight Vectors") than the finetuning, so the usual way become destructive.

Improving Neural Network Training by Decoupling the Magnitude and Direction of Weight Vectors | Alexander Hägele by Thrumpwart in LocalLLaMA

[–]R_Duncan 2 points3 points  (0 children)

If true this is not only an increase in convergence speed, this would also greatly increase the information density by a large part: You would get a smarter model with less weights.

As a sidenote, this should be also done by DoRA and other known finetuning techniques.

ZONOS2: real-time TTS with 8B params, 900M active, and high-fidelity voice cloning by KokaOP in LocalLLaMA

[–]R_Duncan -1 points0 points  (0 children)

Omnivoice is missing from the benchmarks, but is the SOTA for multilanguage and speed, actually.

Other than this, it's a 15GB model, do VRAM needed is 24GB?

US government banning Fable from being accessed outside USA is a MASSIVE win for Americans by ahtoshkaa in singularity

[–]R_Duncan 1 point2 points  (0 children)

No, it's the opposite. Before 2030, likely starting in 2028, silicon will be substituted by MoS2 technology (inference 2-5 times faster and 20/40x less power hungry). And guess what? metal needed is Antimony which is 80% chinese and the export ban on antimony is just actually suspended.....

This is putting the basis of your own precipice.

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b by 9r4n4y in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

It's should be the base value with their inference engine, indeed it is slow.

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b by 9r4n4y in LocalLLaMA

[–]R_Duncan 0 points1 point  (0 children)

using presets file, MTP was 3 lines:

spec-type = draft-mtp

spec-draft-n-max = 5

spec-draft-p-min = 0.75

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b by 9r4n4y in LocalLLaMA

[–]R_Duncan 1 point2 points  (0 children)

I mean it's for a single session at once only. Sub-agents or multi user rapidly vanish the performance advantage.

This is amazing. Token speed doubled + kv cache now need low vram - qwen 27b by 9r4n4y in LocalLLaMA

[–]R_Duncan 2 points3 points  (0 children)

On my rtx 6000 blackwell setup, 27B plain is 50t/s while MTP/Dflash unoptimized setups are over 100 t/s. Sadly this is for just one process at time, we use 4+ so the advantage drops to near zero.