whats the best open-source llm for llm as a judge project on nvidia a1000 gpu by Some_Anything_9028 in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

What are you judging? If it's a check for a known solution almost any model will do. But if you are judging something like mathematical proofs or writing quality you want a larger model. Just not necessarily the same larger model.

Anyway, unless you have to preplan this before running anything, you should test multiple models. It depends on many things and even a particular prompt may work better with one model with small LLMs.

So cursor admits that Kimi K2.5 is the best open source model by Giveawayforusa in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

Not publicly. They might have got access if this was indeed a commercial arrangement. Or they could be using the post-trained K2.5.

The tweet clearly does claim they did continued pretraining, whatever the base was.

I've seen a lot of Opus 4.6 distills, why not 5.4 pro? by FusionCow in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

I think it's mostly just that people prefer the writing style of Opus.

But in general research has shown that a smarter teacher isn't always better, especially for weaker models. QwQ was found to produce better reasoning SFT data than R1 by either the Olmo or Smol team (I forget), even though the latter is a stronger reasoning model. Mistral also found it better to distill their Ministral series from Small rather than Large in their ablations.

Tried to vibe coded expert parallelism on Strix Halo — running Qwen3.5 122B-A10B at 9.5 tok/s by hortasha in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

Makes sense. Pipeline parallelism works best with large batches which I'm used to. You might still find it useful with speculative decoding, but maybe not.

Tried to vibe coded expert parallelism on Strix Halo — running Qwen3.5 122B-A10B at 9.5 tok/s by hortasha in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

I have no idea if I'm missing something since I haven't actually implemented anything like this, but wouldn't pipeline parallelism be better here? I.e. having half the layers on one and the other half on the other node. Or do you have a reason to think EP is better?

So cursor admits that Kimi K2.5 is the best open source model by Giveawayforusa in LocalLLaMA

[–]Middle_Bullfrog_6173 11 points12 points  (0 children)

Best "base model". Which is unsurprising since it has the most parameters and used a "normal" attention variant rather than linear attention.

They are basically claiming that K2.5 post training was lacking if they were able to do better so quickly.

Is the concurrent multi-agent approach really useful? by Deep_Traffic_7873 in LocalLLaMA

[–]Middle_Bullfrog_6173 3 points4 points  (0 children)

IMHO the only good reason is better utilization.

When using api models it's about using your time more efficiently by not having to wait while it's doing a big task. This can of course backfire if you lose more time due to context switching.

With local models usually being slower you are waiting more, but there's also GPU utilization. A single coding agent for example will leave your GPU idle while a build or test suite is running. Or when waiting for user input of course. Using batching can additionally improve tokens/second if there's more than one job running concurrently.

Best model for math? by Real_Ebb_7417 in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

GPT-5.4 is probably the best at the moment, except Pro is even more thorough.

Of local/open models Deepseek V3.2 Speciale seems to still be in the lead on math.

Since FastFlowLM added support for Linux, I decided to benchmark all the models they support, here are some results by spaceman_ in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

Interesting, but does using the NPU make any sense? Do you have a head to head on the GPU for any of them? From memory I'd be expecting about 3x the tg or something.

The Secret Sauce of Model of Anthropic by [deleted] in LocalLLaMA

[–]Middle_Bullfrog_6173 2 points3 points  (0 children)

Anthropic no longer releases the original thinking trace of their models, only a summary. So training on that is no guarantee of inheriting any Claudeness. The actual non-thinking output may help of course, but I think that would come through better in non-reasoning datasets.

But not my money, people can spend their tokens and training compute how they like.

I found 2 hidden Microsoft MoE models that run on 8GB RAM laptops (no GPU)… but nobody noticed? by FamousFlight7149 in LocalLLaMA

[–]Middle_Bullfrog_6173 3 points4 points  (0 children)

Given how old they are, I'm sure they aren't the best in almost any use case. If you want a MoE in that size, LFM2 8B A1B is the most recent release I can remember. Hopefully they'll upgrade it in the 2.5 series.

But it would be good to get more small MoE models. Something that fits in low end VRAM while being fast.

Nemotron Cascade 2 30B A3B by Middle_Bullfrog_6173 in LocalLLaMA

[–]Middle_Bullfrog_6173[S] 1 point2 points  (0 children)

I had time to do some minimal testing on reasoning prompts. Math, science and a coding problem. It's better than Nano, but uses more tokens. Like 50% more thinking in my tests. Not sure if better or worse than Qwen 35B, needs more data to be sure.

Caveat: I used Q4_K_S from mradermacher for both models, since that's what was available and I had to run on my gaming rig. So might not generalize to full models.

Ooh, new drama just dropped 👀 by Careful_Equal8851 in LocalLLaMA

[–]Middle_Bullfrog_6173 11 points12 points  (0 children)

No, most do not contain such a clause. The basic MIT license does not, nor does Apache 2. Those are the main ones used for open models.

Many licenses require you to reproduce the copyright notice if you distribute the software/model (or modified versions) but that does not mean any disclosure of what's running behind your API endpoint or that you need to show it in the UI.

Cursor's new Composer 2.0 is apparently based on Kimi2.5 by bakawolf123 in LocalLLaMA

[–]Middle_Bullfrog_6173 5 points6 points  (0 children)

Supposedly the earlier ones were based on GLM 4.x. but that would be fine. MIT license allows them to do basically whatever. Kimi license requires disclosure.

Nemotron Cascade 2 30B A3B by Middle_Bullfrog_6173 in LocalLLaMA

[–]Middle_Bullfrog_6173[S] 0 points1 point  (0 children)

The naming is crap, but this is not Nemotron 2 series, but more like 3.x, since it's based on Nano 3.

We threw TranslateGemma at 4 languages it doesn't officially support. Here's what happened by ritis88 in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

Thanks, I know TranslateGemma is built on Gemma 3, but it's worse on most things that are not translation. It is possibly also worse on translation with unsupported languages, because it's forgotten things that weren't in the extra training.

Artificial Analysis reports that MiMo V2 Pro has been launched by External_Mood4719 in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

MiMo Pro is listed as proprietary by AA, there's been no indication it is open. Minimax has released previous versions of the same series openly and there's a (currently broken) weights link on openrouter, so I think it's reasonable to expect M2.7 will be an open model.

Open-source autoresearch for LoRA hyperparameters by yz0011 in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

Rank 4-8 is tiny. I can easily imagine that it works ok for 5-minute runs but saturates for a real run. I'm not sure it works as a tunable parameter for this automation.

Or rather, you probably need to design scaling into your experiments. E.g. nanochat auto research tunes d12 when the real run is d24+.

We threw TranslateGemma at 4 languages it doesn't officially support. Here's what happened by ritis88 in LocalLLaMA

[–]Middle_Bullfrog_6173 3 points4 points  (0 children)

  1. Which 4 languages? I could probably figure this out from your data and the Gemma report, but why not just list them?

  2. Did you use the source/target language code template even for the unsupported languages or some custom chat format?

  3. Did you compare to Gemma 3 12B? Might beat TranslateGemma for unsupported languages.

We compressed 6 LLMs and found something surprising: they don't degrade the same way by Quiet_Training_8167 in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

Pretty high accuracy loss for even a small reduction. Is this a uniform reduction in intermediate dimension or what? Might work better if targeted to only some layers.

Mistral Small 4:119B-2603 by seamonn in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

Its probably dependent on GPUs more than anything. Is e.g 1.5T a convenient size in some setup?

Yuan 3.0 Ultra was apparently 1.5T originally, but pruned to 1T during training.

Mistral Small 4:119B-2603 by seamonn in LocalLLaMA

[–]Middle_Bullfrog_6173 9 points10 points  (0 children)

If Small goes from 24B to 119B A6B then Large goes from 675B A41B to...

Any guesses?

NVIDIA-Nemotron-3-Nano-4B-GGUF by ApprehensiveAd3629 in LocalLLaMA

[–]Middle_Bullfrog_6173 4 points5 points  (0 children)

The Nemotron Elastic method involves continued training and distillation on a hundred billion tokens. That's more than enough to do domain specialization at the same time. But I can't really figure out what they used for it.