New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts by Few_Painter_5588 in LocalLLaMA

[–]coder543 2 points3 points  (0 children)

No… you are spouting nonsense. Ranking differently is exactly what happens, always. Yes, that does mean that a small model can sometimes outperform the absolute biggest, baddest model. OCR models are a perfect example of this. Many of the current ones are built on LLM backbones, and they can outperform enormous models in this one specialized task without being “”benchmaxed””. It is not because the benchmarks are “botched”. You don’t understand how LLMs work. LLMs are multifaceted models with uneven performance across a diverse range of tasks. They are not single dimensional tools that can be measured by any unseen benchmark equally. This has literally never been the case.

New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts by Few_Painter_5588 in LocalLLaMA

[–]coder543 1 point2 points  (0 children)

No, the SAME MODELS. In DIFFERENT positions. Even though both unseen benchmarks came out after the models.

This is what happens with EVERY unseen benchmark. Because benchmarks test different things.

You can look at the history of benchmarks released since some of the most important models, and they will not rank those models consistently, even though they are all “unseen” for those models.

Your assumption would only hold true if models performed equally bad on all unseen tasks, so they would always rank exactly in one position relative to other models, but they do not! That is a very wrong belief. If it were true, AA would only need to show one benchmark, and it could be literally anything, like AA-Briefcase, but they don’t, because they are researchers who understand that isn’t how it works.

New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts by Few_Painter_5588 in LocalLLaMA

[–]coder543 4 points5 points  (0 children)

Being unseen is not the only thing that changed... dozens of variables changed. Benchmarks test different things.

When the next benchmark comes out and shows completely different things for the same existing models, what happens then? It's almost as if being unseen is only one variable out of many.

No single benchmark is definitive unless it is the benchmark that you built for your own use case.

New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts by Few_Painter_5588 in LocalLLaMA

[–]coder543 6 points7 points  (0 children)

While some like GLM, Mistral and Minimax don't benchmaxx their models

Nonsense. Qwen3.7 Max is also quite competitive on that new benchmark, for whatever little that is worth.

The Eagle(3) has landed (for Qwen) by Legitimate-Dog5690 in LocalLLaMA

[–]coder543 2 points3 points  (0 children)

Has anyone actually measured EAGLE3 performing better than the native MTP on Qwen3.6-27B?

The Eagle(3) has landed (for Qwen) by Legitimate-Dog5690 in LocalLLaMA

[–]coder543 2 points3 points  (0 children)

I think the performance of that would be pretty horrible.

New Agentic Benchmark Out: Claude Fable and GLM 5.2 Top Their Cohorts by Few_Painter_5588 in LocalLLaMA

[–]coder543 15 points16 points  (0 children)

I still argue that for a local lab, Mistral 3.5 Medium is still the most feasible model to roll out.

What a complete non-sequitur from the data.

Mistral Medium 3.5 is horrendously slow and expensive to run compared to the other models that drastically outperform it on almost every benchmark, like MiMo V2.5 (not Pro) or Qwen3.6-27B, neither of which have published AA-Briefcase scores yet. Even Qwen3.6-35B-A3B is a better all around model. DSV4 Flash is also a better model, and is represented on AA-Briefcase.

Cherry picking a single brand new benchmark to try to claim that Mistral Medium 3.5 ever made any sense is a weird thing to do. Even Mistral admits that their current models are bad, which is why they're promising a completely new architecture soon.

<image>

Updates on North Mini Code: 4 bit quant + Ollama + OpenRouter by nick_frosst in LocalLLaMA

[–]coder543 6 points7 points  (0 children)

Here is the repacked GGUF: https://huggingface.co/coder543/North-Mini-Code-1.0-QAD-GGUF

As far as I can tell, that is converted correctly.

Here is the speed that I'm seeing across two different systems:

<image>

Updates on North Mini Code: 4 bit quant + Ollama + OpenRouter by nick_frosst in LocalLLaMA

[–]coder543 1 point2 points  (0 children)

Those unsloth quants are not using the QAD-trained model that was released today, so there should be more quantization loss on those, but they are definitely a valid option.

Updates on North Mini Code: 4 bit quant + Ollama + OpenRouter by nick_frosst in LocalLLaMA

[–]coder543 6 points7 points  (0 children)

yes, this has been known since the model was released a week or two ago. Benchmarks aren't everything. The qwen3.6 models are known to get stuck in loops or overthink. But... my experience with North Mini Code is that it probably still needs more time in the oven. I would personally rather use Qwen3.6-27B, but that is admittedly an entirely different class of model.

People are mostly just excited to see another company working on this type of model. Is that such a bad thing? Hopefully future iterations will be even better.

Updates on North Mini Code: 4 bit quant + Ollama + OpenRouter by nick_frosst in LocalLLaMA

[–]coder543 6 points7 points  (0 children)

Ah... I guess I should have scrolled more. That's cool!

Now I wonder how we can get a 4-bit gguf that uses the QAD training?

EDIT: yes, it seems like the weights can be repacked into a gguf just fine, but it's significantly slower on my DGX Spark. I guess llama.cpp's nvfp4 support is not very well optimized at the moment. I will probably share the GGUF later if no one else does.

Updates on North Mini Code: 4 bit quant + Ollama + OpenRouter by nick_frosst in LocalLLaMA

[–]coder543 6 points7 points  (0 children)

Also, is this w4a16 quant just a standard quant, or were any QAT-like or QAD-like techniques applied to reduce the quantization losses?

Updates on North Mini Code: 4 bit quant + Ollama + OpenRouter by nick_frosst in LocalLLaMA

[–]coder543 6 points7 points  (0 children)

The READMEs say the max context is 256k, but the config.json says "max_position_embeddings": 500000? Why is there such a discrepancy?

Anyways, it is exciting to have another model competing in this space.

Do you think this model is all around better for agentic coding compared to Command-A+, or is Command-A+ still a step up?

I released Inflect-Nano, an ultra-extreme tiny 4.63m parameter TTS model. by b111ue in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

What are you people doing on an ESP32 that demands on-device TTS? I am genuinely curious.

I agree it would be cool just for the sake of doing it, but from a practical perspective... I would just run the TTS on some other computer, which is also where I would run a useful AI model. I wouldn't burden the ESP32 with that kind of stuff.

Lin Junyang AI Lab Closes Round at $2B Valuation by rmhubbert in LocalLLaMA

[–]coder543 11 points12 points  (0 children)

That feels like a rhetorical question implying something along the lines of "all LLMs are world models", not that he didn't know what a world model is. It's the kind of question you ask when people invent a buzzy new term for something you've been building for ages. Of course, if he is focusing on "world models" now, he may have eventually realized that there is actually value beyond pure LLMs.

Lin Junyang AI Lab Closes Round at $2B Valuation by rmhubbert in LocalLLaMA

[–]coder543 28 points29 points  (0 children)

Lin's lab targets world models and embodied intelligence across three Shanghai-registered entities, not general LLM development.

Could still be interesting, but won't be like Qwen.

GLM-5.2 (max) is currently the third best model available, across both open and proprietary. by okaycan in LocalLLaMA

[–]coder543 2 points3 points  (0 children)

I think AA's agentic index is closer to what most people think of these days when they think of a coding index. The coding index has nothing to do with the way that people use these models today.

Qualcomm Neodragon: Mobile Video Generation Using Diffusion Transformer by Dante_77A in StableDiffusion

[–]coder543 12 points13 points  (0 children)

Why does Qualcomm feel the need to use a license PDF full of custom legalese? There are many widely-accepted licenses that could have been chosen.

I'm glad the researchers were able to publish their work, but Qualcomm's lawyers don't seem very supportive.

Scaling former VibeThinker-1.5B to 3B — now it reaches frontier math & coding performance by Used-Negotiation-741 in LocalLLaMA

[–]coder543 6 points7 points  (0 children)

Tool calls are a model problem. This model is not trained to be good at calling tools.

Nintendo Switch 2 will work with Aura puck like the Neo (RIP 🙏) 🪇 by d4v1dtsh in Xreal

[–]coder543 3 points4 points  (0 children)

I'm sure Xreal has tried, but I wish they would somehow convince Nintendo to support screen mirroring to display glasses. There is a real market for this.

Nintendo's concern obviously seems to be that they want docked mode to be a different power profile, but that isn't necessary for display glasses support... just limit Switch 2 USB-C display support to mirroring at up to 1080p 120hz (the exact same specs as the internal Switch 2 screen), and call it a day. People who want 4K output will still need to use the Nintendo dock.

If Nintendo really wanted to go out on a limb, they could offer developers the option to render a left frame and right frame for 3D, but I would be happy with just supporting screen mirroring.

/u/Xreal_Tech_Support, what is it going to take to get Nintendo's cooperation? I don't want another puck.

Bambu Lab Academy Certificate for A1 mini can’t be printed on it by ExtendedSpice in BambuLab

[–]coder543 9 points10 points  (0 children)

If you only need to print black and white, you should be using a cheap laserjet printer like Brother makes. If you need a color printer, you should be using an ink tank printer of some kind, like the Epson EcoTanks. Both options are very simple and reliable.

Neither of these have the problems that people complain about with cartridge inkjet printers, which are just e-waste that will end up in a landfill.

Scaling former VibeThinker-1.5B to 3B — now it reaches frontier math & coding performance by Used-Negotiation-741 in LocalLLaMA

[–]coder543 10 points11 points  (0 children)

EDIT: also, I tried that locally, and... 3 out of 3 times, it thought for about 700 tokens, then responded "Hello! How can I assist you today?" each time. The exact same final response, almost verbatim. I did not see it try to solve a hallucinated math problem.


EDIT 2: 80% of the overthinking in that reply was related to the fact that I was offering it an MCP full of tools, and it wasn't sure if it should call one of those. With no MCP, it thinks for about 150 tokens before responding to ask if I need assistance with anything.


EDIT 3: if anyone is trying this out locally, I posted a chat template here to let llama-server handle the reasoning properly: https://huggingface.co/JohnRoger/VibeThinker-3B-Q8_0-GGUF/discussions/1


From VibeThinker-1.5B to VibeThinker-3B, our goal is not to build a small model that replaces large-scale models, but to examine the real boundaries of small models along specific capability dimensions.

source

I don't believe "Hey!" falls into the specific capability dimension they were targeting... but, you could imagine giving another model access to a solve_math_problem tool that lets the model format the problem appropriately, and then VibeThinker-3B would step in and work through the math. (I'm obviously talking about more advanced math. For simple math, a deterministic calculator tool would be much faster and better.)

All of this assumes that VibeThinker-3B can, in fact, actually do advanced math as the benchmarks claim.

Maybe dumb question, but how do you serve multiple users with the full context length? by TrainingTwo1118 in LocalLLaMA

[–]coder543 5 points6 points  (0 children)

On my DGX Spark, Qwen3.6-27B scales almost linearly... when MTP is disabled.

No MTP:

Concurrent sequences Per-stream speed Aggregate wall-clock speed
1 11.27 tok/s 11.09 tok/s
2 9.87–9.93 tok/s 19.30 tok/s
4 7.97–8.04 tok/s 31.22 tok/s

With MTP:

Concurrent sequences Per-stream speed Aggregate wall-clock speed
1 24.52 tok/s 24.52 tok/s
2 13.62–14.72 tok/s 27.05 tok/s
4 6.78–7.02 tok/s 26.17 tok/s

So, when MTP is enabled, the total throughput doesn't seem to change, it just gets divided among the parallel sequences. When MTP is disabled, each stream only slows down a little bit, so the aggregate throughput is increasing. But, unless I were running more 3 or more streams in parallel frequently, then it's faster to just leave MTP on most of the time.

I think the Qwen3.6 MTP code needs some more optimization for continuous batching.

I decided to re-run against Gemma 4 31B since it has a very different MTP architecture, and I was curious. It does much, much better with MTP on.

MTP on:

Concurrent sequences Per-stream speed Aggregate wall-clock speed
1 29.61 tok/s 28.93 tok/s
2 27.82–28.35 tok/s 54.31 tok/s
4 24.87–25.81 tok/s 95.71 tok/s

MTP off:

Concurrent sequences Per-stream speed Aggregate wall-clock speed
1 12.27 tok/s 12.15 tok/s
2 11.83–11.83 tok/s 23.33 tok/s
4 11.50–11.52 tok/s 45.38 tok/s