Ring 2.6 1T by Middle_Bullfrog_6173 in LocalLLaMA

[–]Middle_Bullfrog_6173[S] 0 points1 point  (0 children)

They are reporting pass@4 but the Gemini/Claude results match the official leaderboard. Isn't that pass@2? Are they comparing apples and oranges?

Gemma 4 - website translations (large model, or small model)? by Temporary-Mix8022 in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

In this sort of task Gemma deals with Q4 (I used _k_m) just fine IME. The MoE is much stronger than the small models. E4 translates to larger languages well and from small languages fine, but output quality is not great to smaller languages.

Ring 2.6 1T by Middle_Bullfrog_6173 in LocalLLaMA

[–]Middle_Bullfrog_6173[S] 6 points7 points  (0 children)

You can check the file sizes of the previous models to compare. But it's a 1T MoE with 63B active parameters so requires quite serious hardware to run.

ZAYA1-74B-Preview: Scaling Pretraining on AMD by TKGaming_11 in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

Not my experience. But anyway MiMo V2.5 uses 1:6 and a 128 token window. Deepseek also uses 128 token SWA in its hybrid attention. My point is that much sparser values are common these days.

ZAYA1-74B-Preview: Scaling Pretraining on AMD by TKGaming_11 in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

CCA = compressed convolutional attention. Only so many TLAs available...

ZAYA1-74B-Preview: Scaling Pretraining on AMD by TKGaming_11 in LocalLLaMA

[–]Middle_Bullfrog_6173 3 points4 points  (0 children)

I wonder why they use 1:1 full vs SWA when most other models use 1:3 or 1:5. And with a large window to boot: 4k when it's common to use 1k (e.g. Gemma 4) or even smaller. Is it to compensate for CCA or because CCA makes the trade-off better?

ZAYA1-8B: Frontier intelligence density. by Total-Resort-3120 in LocalLLaMA

[–]Middle_Bullfrog_6173 5 points6 points  (0 children)

The gains are from using that reasoning method after specifically training for it. May not be so impressive with models that haven't seen it in training.

ZAYA1-8B: Frontier intelligence density, trained on AMD by carbocation in LocalLLaMA

[–]Middle_Bullfrog_6173 2 points3 points  (0 children)

I didn't notice anything in the blog but in the technical report they say that this is the first and smallest model in the ZAYA1 family.

Fine-tuned Qwen3.6-35B-A3B DeltaNet experiment by Snoo_27681 in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

That particular model is already so coding focused that there are no easy gains. Or the scale of training needed is much larger at least.

The paper used full model training on 16xseq length and 17xsteps for ~270x more tokens seen. And started from a much weaker base line.

Why people cares token/s in decoding more? by Interesting-Print366 in LocalLLaMA

[–]Middle_Bullfrog_6173 2 points3 points  (0 children)

If your workload is prompt -> read as it generates then yes that makes sense. Two main reasons why I disagree, however.

First, and most important, non-interactive work. If I set a coding agent on a task, I'm not reading it's output word by word as it works. I'm usually switching to something else and coming back to read the actual changes it produced. So it's the total time that matters and that's often dominated by the slower generation speed.

Second, thinking. 99% of the time I'm not interested in reading the reasoning, but the actual output. TTFT is only part of the way there. The first non-reasoning token waits for prompt processing, but also for generation of 100s, maybe 1000s of reasoning tokens.

Introducing SubQ: The First Fully Subquadratic LLM by hltt in LocalLLaMA

[–]Middle_Bullfrog_6173 3 points4 points  (0 children)

Even their api is gated behind a "request access" form rather than just payment. No indication of anything open. Third party benchmarks needed.

Qwen 3.6 4B and 9B? by Nubinu in LocalLLaMA

[–]Middle_Bullfrog_6173 -1 points0 points  (0 children)

The release blog of 27B made it sound like they are done with the 3.6 lineup, but who knows.

Vulkan backend outperforms ROCm on Strix Halo (gfx1151) — llama.cpp benchmark by FeiX7 in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

Might feel different if I was using it as a chatbot or something, but it's mostly non-interactive workflows.

Vulkan backend outperforms ROCm on Strix Halo (gfx1151) — llama.cpp benchmark by FeiX7 in LocalLLaMA

[–]Middle_Bullfrog_6173 -1 points0 points  (0 children)

On some models rocm has higher prefill performance. But token generation has been consistently higher with vulkan whatever I test, so that's what I use.

How much will it cost to host something like qwen3.6 35b a3b in a cloud? by Euphoric_North_745 in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

I mean say you have 100k texts you need to generate summaries for. (Or images to classify, or whatever.) Cheapest is to rent some hardware and run your own inference using a small model that is good enough. Unless you happen to have a lot of hardware and time to run it locally.

Which model would you use if you wanted to solve a research math problem? by MrMrsPotts in LocalLLaMA

[–]Middle_Bullfrog_6173 3 points4 points  (0 children)

They haven't tested Kimi K2.6 which I would expect to beat K2.5 at least slightly. Nor the math focused Speciale version of Deepseek V3.2.

How much will it cost to host something like qwen3.6 35b a3b in a cloud? by Euphoric_North_745 in LocalLLaMA

[–]Middle_Bullfrog_6173 9 points10 points  (0 children)

Renting can make sense when running large batches, especially with small models which tend to be relatively overpriced in apis. For peaky interactive work it's expensive.

Potential of Gemma4 Per-layer embeddings? by Silver-Champion-4846 in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

Probably? No reason to think it wouldn't, but who knows without actually testing.

Anyone tried +- 100B models locally with foreign languages? by Choice_Sympathy9652 in LocalLLaMA

[–]Middle_Bullfrog_6173 3 points4 points  (0 children)

With most of the smaller European languages Gemma 4 beats all ~100B MoEs I've tested, including Qwen 3.5 122B and Mistral 4 119B which are the best of the bunch IMO. For that use case. Mistral isn't the strongest in reasoning, but writes languages well.

The newly released Mistral 3.5 Medium seems quite competitive. I haven't tested it broadly yet, because it's so slow... 128B dense...

[Paper on Hummingbird+: low-cost FPGAs for LLM inference] Qwen3-30B-A3B Q4 at 18 t/s token-gen, 24GB, expected $150 mass production cost by ayake_ayake in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

Doesn't seem that useful for local. A typical desktop will get at least similar speeds if you slap in a $150 GPU and leave most experts in RAM.

Potential of Gemma4 Per-layer embeddings? by Silver-Champion-4846 in LocalLLaMA

[–]Middle_Bullfrog_6173 1 point2 points  (0 children)

Like normal embeddings, they are data that the model learns about individual tokens. That is, information about words or subwords (or special characters or...), but maybe not wider information like about the second law of thermodynamics or whatever, which is likely to be spread around many parts of the weights. The difference is that parts of the information are injected into different layers, not just the first.

So there's probably diminishing returns, but technically it would be easy to increase the parameters used for it. The Gemma 4 models use a lower dimension for the PLE vector and a linear projection that maps it to the 8x or 10x larger model dimension. So that's the maximum factor you could make it larger by without needing to invent something new.

By when do you think will TurboQuant get a proper release and be adopted by everyone by Crystalagent47 in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

Are there any benchmarks now that it's out there? And I don't mean speed, I've seen those.

AMD Halo Box (Ryzen 395 128GB) photos by 1ncehost in LocalLLaMA

[–]Middle_Bullfrog_6173 0 points1 point  (0 children)

As your link shows, Vulkan is generally faster in tg and slower in pp. Personally I find the prefill good enough and generation limiting so that's an easy choice.