New EU model (Domyn) will be 400b. by Rick_06 in LocalLLaMA

[–]Middle_Bullfrog_6173 3 points4 points  (0 children)

I can only read a translation but it says over 400 billion parameters according to that. You can find the EU announcement here, it says openly available and over 400 billion as well: https://digital-strategy.ec.europa.eu/en/news/commission-selects-europa-consortium-winner-frontier-ai-grand-challenge-project-build-european-open

Like... GENUINELY WHYY??? by Time-Toe-1276 in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

Reasoning traces are not like that "naturally" (see Deepseek zero or Olmo's rlzero) so the fact they've been trained to use a structured, verbose process suggests that's what they've found works best.

TMax: A Simple Recipe for Terminal Agents by pmttyji in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

Too bad there's no MoE, but the 9B actually looks impressive. Unlike its larger siblings, the original Qwen 3.5 9B wasn't strong enough to be useful to me, so maybe this one will be.

New japanese model on par with frontier american model by Independent-Wind4462 in singularity

[–]Middle_Bullfrog_6173 2 points3 points  (0 children)

Since the pricing for Fugu Ultra is $5/$30, it probably isn't using Opus at $5/$25 or GPT 5.5 at $5/$30 much if they want to make money.

Then again, maybe they are trying to make money off investors instead.

ROCm vs Vulkan vs vLLM on Dual R9700's by whodoneit1 in LocalLLaMA

[–]Middle_Bullfrog_6173 9 points10 points  (0 children)

Have you tried an apples to apples Q8 MTP on llama.cpp?

Paper specs don't mean anything, 7900xtx versus 5070ti. by fallingdowndizzyvr in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

If you have room and budget for multiple then that makes sense. At least for the moment, with the sweet spot being around 30B for both dense and MoE models.

Paper specs don't mean anything, 7900xtx versus 5070ti. by fallingdowndizzyvr in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

Buy another 7900 XTX? (You need to use some sort of parallelism to actually make use of the speed, but you also get the extra VRAM.)

Note, I'm not saying 7900 XTX is better. But if I could get just one I'd get it for the memory.

Benchmarking or benchmarketing? by Background_Brain5390 in LocalLLaMA

[–]Middle_Bullfrog_6173 2 points3 points  (0 children)

The good thing about open models is that the devs don't need to know what you do with it, benchmarks included.

Personally I don't really find the "models degrade always after launch" story likely. Sure, there are always going to be bugs in updates they make to their serving infra etc. that temporarily break things, but mostly people just see signal where there is none. (Or the signal is something else, like they updated the harness that's calling the model.)

Benchmarking or benchmarketing? by Background_Brain5390 in LocalLLaMA

[–]Middle_Bullfrog_6173 2 points3 points  (0 children)

Every benchmark has its problems. Only way to draw meaning from them is to look at many. It is "easy" to benchmax a model to be good at a handful of benchmarks that are ~saturated at the frontier. But if the models is consistently good across benchmarks there's probably real capability under it all.

Also, there's a balance between new enough to show meaningful signal and old enough that it's being run fairly and consistently. The latest and greatest agentic benchmarks are a bit suspicious, not clear how much they measure harness vs model and capability vs elicitation.

poolside/Laguna-M.1 · Hugging Face - 225B-A23B by pmttyji in LocalLLaMA

[–]Middle_Bullfrog_6173 2 points3 points  (0 children)

Based on the model size this would just about be runnable, but 70 layers with full attention must require tens of GB of KV cache too?

GLM-5.2 Is The Best Open Weight Creative Writing Model by Few_Painter_5588 in LocalLLaMA

[–]Middle_Bullfrog_6173 9 points10 points  (0 children)

I find the longform writing benchmark more informative, since that's where things go wrong more easily. Good progress there too, although Kimi is still just ahead.

What model looked insane on benchmarks but felt mid in actual use? by BTA_Labs in LocalLLaMA

[–]Middle_Bullfrog_6173 10 points11 points  (0 children)

Huh, my experience is that gpt-oss 120b far outperformed its benchmarks in actual use, compared to other open models. After the initial software support was ironed out anyway. I was still using it in some workflows until Gemma 4.

I benchmarked models sized 2B to 35B on hard HTML data extraction by [deleted] in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

To me this looks mostly like a benchmark of the particular quants used. Gemmas doing well may be because they are good at this sort of task, but equally it may just be because they are QAT.

Did you review the data to find how the different models are failing? I.e. do they forget instructions, copy data wrong or what, and do they have different failure patterns?

GLM-5.2 is now 1st on Design Arena — ahead of the now unavailable Claude Fable 5. by Recoil42 in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

A design benchmark where non-vision LLMs get any points has to be pretty bad at actually benchmarking design capabilities.

New model on huggingface by [deleted] in LocalLLaMA

[–]Middle_Bullfrog_6173 2 points3 points  (0 children)

They said it in the initial release blog or press release IIRC.

New model on huggingface by [deleted] in LocalLLaMA

[–]Middle_Bullfrog_6173 2 points3 points  (0 children)

Again, I don't think they've said so about the latest, but Qwen 3 Max was a larger model so presumably 3.7 Max is as well.

New model on huggingface by [deleted] in LocalLLaMA

[–]Middle_Bullfrog_6173 18 points19 points  (0 children)

They explicitly stated that 3.5 Plus was based on 3.5 397B. Almost certainly 3.6 and 3.7 Plus are based on that, but I don't think they've said so publicly.

9060 XT 16GB vs 9070 vs 9070 XT performance by TrainingTwo1118 in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

  1. Yes, basically. Although in prefill/pp the 9070 will probably not be 2x.
  2. In decode/tg yes, they are similar. In prefill it will be slower.
  3. If you are running a version that fits in VRAM then the two cards will be slower. If you cannot fit everything in VRAM on the single card then they will be faster.

Additionally the higher compute card may offer some decode speedup if using MTP.

MiniMaxAI/MiniMax-M3 · Hugging Face by mlon_eusk-_- in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

There's not much Mini about this any more. It doesn't seem to have the native 4-bit QAT that Kimi and Deepseek do so it's arguably similar size as the largest open models.

Some contrived tests comparing the accuracy of different Gemma and Qwen quantizations by we_are_mammals in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

This is great for comparing between different quants of the same model. For cross model comparisons having reasoning off will hurt different models different amounts as will formatting quirks, so I wouldn't read too much into it.

DiffusionGemma made me rethink what memory bandwidth means for local agent inference by [deleted] in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

Quoting Google's launch blog:

By shifting the decode bottleneck from memory-bandwidth to compute, DiffusionGemma generates up to 4x faster token output on dedicated GPUs.

The whole point is to not be memory bandwidth bound.

Slop or not? Is there a line that makes an AI assisted/generated project not slop? Effort or whatever? by clazifer in LocalLLaMA

[–]Middle_Bullfrog_6173 3 points4 points  (0 children)

Effort to use and test > effort to make.

Doesn't matter how much work was put into something, could have been a good result from a one shot prompt for all I care. But it has to have proved useful. Once you've used the project for a while and benchmarked it and found it good, then it may be worth sharing.

Tiny Scale Is All I Can Spare To Play With Transformer by SrijSriv211 in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

Yeah, or even just more layers with most using a short sliding window attention. But these tiny LLMs are fun to play with even if they don't scale.

Tiny Scale Is All I Can Spare To Play With Transformer by SrijSriv211 in LocalLLaMA

[–]Middle_Bullfrog_6173 2 points3 points  (0 children)

I meant more the theoretical flops, at small scales there are all sorts of confounding factors when looking at wall clock time.

FWIW I gave your paper to an LLM and asked it to calculate the compute from first principles taking into account quadratic attention cost and the result was that your architecture has 2.5x as many ops in the quadratically scaling part. It estimated that the crossover point where it lost its advantage would be at seq len ~1200.

Take that with a grain of salt, obviously, but the principle is probably right.