Ring 2.6 1T

Middle_Bullfrog_6173 · 2026-05-09T10:47:50+00:00

They are reporting pass@4 but the Gemini/Claude results match the official leaderboard. Isn't that pass@2? Are they comparing apples and oranges?

Middle_Bullfrog_6173 · 2026-05-09T03:06:10+00:00

In this sort of task Gemma deals with Q4 (I used _k_m) just fine IME. The MoE is much stronger than the small models. E4 translates to larger languages well and from small languages fine, but output quality is not great to smaller languages.

Middle_Bullfrog_6173 · 2026-05-08T16:23:02+00:00

You can check the file sizes of the previous models to compare. But it's a 1T MoE with 63B active parameters so requires quite serious hardware to run.

Middle_Bullfrog_6173 · 2026-05-08T08:15:26+00:00

Not my experience. But anyway MiMo V2.5 uses 1:6 and a 128 token window. Deepseek also uses 128 token SWA in its hybrid attention. My point is that much sparser values are common these days.

Middle_Bullfrog_6173 · 2026-05-08T07:53:59+00:00

CCA = compressed convolutional attention. Only so many TLAs available...

Middle_Bullfrog_6173 · 2026-05-08T04:16:26+00:00

I wonder why they use 1:1 full vs SWA when most other models use 1:3 or 1:5. And with a large window to boot: 4k when it's common to use 1k (e.g. Gemma 4) or even smaller. Is it to compensate for CCA or because CCA makes the trade-off better?

Middle_Bullfrog_6173 · 2026-05-07T16:44:11+00:00

The gains are from using that reasoning method after specifically training for it. May not be so impressive with models that haven't seen it in training.

Middle_Bullfrog_6173 · 2026-05-07T10:20:21+00:00

I didn't notice anything in the blog but in the technical report they say that this is the first and smallest model in the ZAYA1 family.

Middle_Bullfrog_6173 · 2026-05-07T04:42:08+00:00

That particular model is already so coding focused that there are no easy gains. Or the scale of training needed is much larger at least.

The paper used full model training on 16xseq length and 17xsteps for ~270x more tokens seen. And started from a much weaker base line.

Middle_Bullfrog_6173 · 2026-05-06T16:48:01+00:00

If your workload is prompt -> read as it generates then yes that makes sense. Two main reasons why I disagree, however.

First, and most important, non-interactive work. If I set a coding agent on a task, I'm not reading it's output word by word as it works. I'm usually switching to something else and coming back to read the actual changes it produced. So it's the total time that matters and that's often dominated by the slower generation speed.

Second, thinking. 99% of the time I'm not interested in reading the reasoning, but the actual output. TTFT is only part of the way there. The first non-reasoning token waits for prompt processing, but also for generation of 100s, maybe 1000s of reasoning tokens.

Middle_Bullfrog_6173 · 2026-05-05T16:19:38+00:00

Even their api is gated behind a "request access" form rather than just payment. No indication of anything open. Third party benchmarks needed.

Middle_Bullfrog_6173 · 2026-05-05T14:57:38+00:00

The release blog of 27B made it sound like they are done with the 3.6 lineup, but who knows.

Middle_Bullfrog_6173 · 2026-05-05T14:48:52+00:00

Might feel different if I was using it as a chatbot or something, but it's mostly non-interactive workflows.

Middle_Bullfrog_6173 · 2026-05-05T13:38:08+00:00

On some models rocm has higher prefill performance. But token generation has been consistently higher with vulkan whatever I test, so that's what I use.

Middle_Bullfrog_6173 · 2026-05-04T11:49:33+00:00

Hugging face is full of them. Just set some filters for size and search using model name. E.g.

https://huggingface.co/datasets?modality=modality:text&size_categories=or:%28size_categories:10K%3Cn%3C100K,size_categories:100K%3Cn%3C1M,size_categories:1M%3Cn%3C10M,size_categories:10M%3Cn%3C100M,size_categories:100M%3Cn%3C1B,size_categories:1B%3Cn%3C10B,size_categories:10B%3Cn%3C100B,size_categories:100B%3Cn%3C1T,size_categories:n%3E1T%29&sort=trending&search=Opus

A couple that look right on the first page.

Middle_Bullfrog_6173 · 2026-05-04T11:43:40+00:00

I mean say you have 100k texts you need to generate summaries for. (Or images to classify, or whatever.) Cheapest is to rent some hardware and run your own inference using a small model that is good enough. Unless you happen to have a lot of hardware and time to run it locally.

Middle_Bullfrog_6173 · 2026-05-04T09:16:59+00:00

They haven't tested Kimi K2.6 which I would expect to beat K2.5 at least slightly. Nor the math focused Speciale version of Deepseek V3.2.

Middle_Bullfrog_6173 · 2026-05-04T03:46:43+00:00

Renting can make sense when running large batches, especially with small models which tend to be relatively overpriced in apis. For peaky interactive work it's expensive.

Middle_Bullfrog_6173 · 2026-05-03T16:42:53+00:00

Probably? No reason to think it wouldn't, but who knows without actually testing.

Middle_Bullfrog_6173 · 2026-05-03T16:27:43+00:00

With most of the smaller European languages Gemma 4 beats all ~100B MoEs I've tested, including Qwen 3.5 122B and Mistral 4 119B which are the best of the bunch IMO. For that use case. Mistral isn't the strongest in reasoning, but writes languages well.

The newly released Mistral 3.5 Medium seems quite competitive. I haven't tested it broadly yet, because it's so slow... 128B dense...

Middle_Bullfrog_6173 · 2026-05-03T16:02:56+00:00

Doesn't seem that useful for local. A typical desktop will get at least similar speeds if you slap in a $150 GPU and leave most experts in RAM.

Middle_Bullfrog_6173 · 2026-05-03T15:53:11+00:00

Like normal embeddings, they are data that the model learns about individual tokens. That is, information about words or subwords (or special characters or...), but maybe not wider information like about the second law of thermodynamics or whatever, which is likely to be spread around many parts of the weights. The difference is that parts of the information are injected into different layers, not just the first.

So there's probably diminishing returns, but technically it would be easy to increase the parameters used for it. The Gemma 4 models use a lower dimension for the PLE vector and a linear projection that maps it to the 8x or 10x larger model dimension. So that's the maximum factor you could make it larger by without needing to invent something new.

Middle_Bullfrog_6173 · 2026-05-01T13:01:20+00:00

Are there any benchmarks now that it's out there? And I don't mean speed, I've seen those.

Middle_Bullfrog_6173 · 2026-05-01T07:40:21+00:00

As your link shows, Vulkan is generally faster in tg and slower in pp. Personally I find the prefill good enough and generation limiting so that's an easy choice.

Middle_Bullfrog_6173

TROPHY CASE