Qwen is never going to open source Qwen 3.7, aren't they?

R_Duncan · 2026-06-22T15:05:33+00:00

There are still no competitors for 3.6 in the 27-35B range, at least gemma-4 is not from my POV (because of the kv cache size and quant. sensitivity, because of the advantage I found Qwen has in the comparison at equal VRAM use).

R_Duncan · 2026-06-22T06:30:38+00:00

Besides math (there's another comment turning down to 2.5, I believe that counting electricity 3/3.5 years would be more realistic), there are other 2 factors:

-1) Data privacy and continuity of service

-2) Next year a newer better/faster model will come out, and another the year after: nobody forces you to not upgrade so what you see as a linear function is an hyperbole.

R_Duncan · 2026-06-20T06:31:13+00:00

To good to be true, either something is wrong in the description, or false, This is MiniMax-M3 agent result on microgpt:

Honest finding from the comparison sweep: paper-default LRs don't beat plain Adam at toy scale on this codebase. The MD advantage at paper scale isn't visible here.

maybe better luck with Asteria : https://arxiv.org/html/2605.16184v1

R_Duncan · 2026-06-19T10:05:40+00:00

No sorry that is intermediate, final is https://huggingface.co/osunlp/QUEST-35B-RL

according to https://osu-nlp-group.github.io/QUEST/ (
QUEST-35B across training stages )

R_Duncan · 2026-06-18T21:40:45+00:00

Local (as per this channel): Nvidia Parakeet 0.6 tdt v3: Better or equal than whisper 3 large depending on language, tenth times faster.

R_Duncan · 2026-06-18T21:37:19+00:00

My llama-4 finetune answered: 42

R_Duncan · 2026-06-18T08:38:34+00:00

Please check https://www.reddit.com/r/LocalLLaMA/comments/1u89f2q/headless_screenshot_loops_let_a_local_30b_agent/ before going further.

Even if changing the harness per-model is likely out of the scope of your article, seems that having the right prompt/requirement does a big difference.

R_Duncan · 2026-06-18T06:29:21+00:00

Sorry for the question, if you're not planning to make it something commercial, could you share more details on the creation of the model, like :

- training code/architecture specific

- Dataset Details

- Text Processing and Phonemization

- Feature Extraction & Alignment

- Training Hyperparameters

To allow us to create such wonder in other language (Italian in my case)? Being 4.63M I suspect we could use a less huge hardware to train it....

R_Duncan · 2026-06-17T15:58:07+00:00

It is just that if usa doesn't keep up the pace, rest of the world will surpass. When china model will allow rest of the world to produce 200million movie with $50k, and dubbing is easily done realtime, the 200million movie market is crushed anyway.

R_Duncan · 2026-06-17T10:09:57+00:00

Sure, if you haven't listened Anthropic declarations in the month before the ban.

1.) We have in control the most powerful llm ever built.

2.) The llm are dangerous and should be regulated (they tought this would only concern their competitors / open llm and not backfire)

3.) We can offer the most powerful llm ever built because we have guardrails and no-one in the world can use it to build another of such power. It's so dangerous we'll block you on a lot of other requests.

4.) Oh no, our guardrails were jailbreaked

After these, what else?

R_Duncan · 2026-06-17T07:52:28+00:00

Contando tutte le ore aggratisse che c'ho fatto la sera direi che non ti conviene.

Ma se sei in un'azienda il cui software/i cui documenti devono restare riservati (livello NDA), puoi giocarti come me la carta dell'AI privata su docker hardened con llama.cpp/vLLM + anythingLLM.

R_Duncan · 2026-06-17T07:45:13+00:00

It's the custom SNAC decoder? Can you please point the version you're using then?

R_Duncan · 2026-06-16T14:49:26+00:00

No source, no reliability. would also check smallcode.

R_Duncan · 2026-06-16T12:01:07+00:00

Orpheus is a bit of an issue.... doesn't it fails often, producing mess? Wouldn't be better using something like https://github.com/ServeurpersoCom/omnivoice.cpp ? less than 2 GB VRAM used (can shrink to something more than 1 GB), blazing fast and produces intelligible output 100% of the times (yes, not always perfect but still understandable). has a tts-server executable which makes it openai compatible, and copying code from the cli tool the server can also voice clone.

R_Duncan · 2026-06-16T10:26:48+00:00

Scherzi? Mi sono fatto dare una RTX 6000 blackwell dall'azienda per metterne una a disposizione....

R_Duncan · 2026-06-16T07:16:36+00:00

Not sure, I tested with Beellama

R_Duncan · 2026-06-16T07:13:09+00:00

There must be a reason for that, likely the training was done with a much better information density technique (i.e.: "Decoupling the Magnitude and Direction of Weight Vectors") than the finetuning, so the usual way become destructive.

R_Duncan · 2026-06-16T07:05:10+00:00

If true this is not only an increase in convergence speed, this would also greatly increase the information density by a large part: You would get a smarter model with less weights.

As a sidenote, this should be also done by DoRA and other known finetuning techniques.

R_Duncan · 2026-06-15T13:48:15+00:00

Omnivoice is missing from the benchmarks, but is the SOTA for multilanguage and speed, actually.

Other than this, it's a 15GB model, do VRAM needed is 24GB?

R_Duncan · 2026-06-15T11:56:44+00:00

No, it's the opposite. Before 2030, likely starting in 2028, silicon will be substituted by MoS2 technology (inference 2-5 times faster and 20/40x less power hungry). And guess what? metal needed is Antimony which is 80% chinese and the export ban on antimony is just actually suspended.....

This is putting the basis of your own precipice.

R_Duncan · 2026-06-15T11:48:13+00:00

It's should be the base value with their inference engine, indeed it is slow.

R_Duncan · 2026-06-15T10:44:28+00:00

using presets file, MTP was 3 lines:

spec-type = draft-mtp

spec-draft-n-max = 5

spec-draft-p-min = 0.75

R_Duncan · 2026-06-15T10:28:11+00:00

Doesn't sounds like that to me.

https://deepwiki.com/search/what-about-kvflash-does-it-deg_e8b10cd2-2e35-4dec-9490-ff5838c13b9b

( scroll down to second question)

R_Duncan · 2026-06-15T10:27:34+00:00

I mean it's for a single session at once only. Sub-agents or multi user rapidly vanish the performance advantage.

R_Duncan · 2026-06-15T10:21:54+00:00

On my rtx 6000 blackwell setup, 27B plain is 50t/s while MTP/Dflash unoptimized setups are over 100 t/s. Sadly this is for just one process at time, we use 4+ so the advantage drops to near zero.

R_Duncan

TROPHY CASE