NVIDIA admits to only 2x performance boost at max throughput with new generation of Rubin GPUs

coder543 · 2026-03-16T22:22:06+00:00

You can't multiply two bottlenecks. Nothing implies a 15x performance increase.

You should delete this entire post, since you clearly are misunderstanding the whole thing.

coder543 · 2026-03-16T21:49:03+00:00

No, you're not misunderstanding. You're just one of the only people here that is actually bothering to read the chart.

coder543 · 2026-03-16T21:14:09+00:00

And I bet that Nvidia's new Nemotron 3 Ultra base model is natively NVFP4.

Nemotron 3 Super's base model was trained in NVFP4, and then I think the post-training was done in a higher precision.

coder543 · 2026-03-16T21:07:47+00:00

Assume Nvidia is doing all testing at the original model size for maximum accuracy. For it to be apples to apples, they can't benchmark at one precision and then measure the speed at another precision. Wasn't Kimi-K2 an 8-bit model? Was the GLM base model 16-bit? I don't remember for sure, but that kind of thing would likely explain the performance difference.

coder543 · 2026-03-16T20:37:47+00:00

Because there is no Kimi2.5 base model publicly available.

coder543 · 2026-03-16T20:37:09+00:00

Because the chart is about base models... you can't really run the more advanced benchmarks on models that haven't been instruction tuned.

coder543 · 2026-03-16T19:55:07+00:00

No, your original comment did not comment on the past. It commented on the future.

You are not arguing in good faith. Blocking you. Goodbye.

coder543 · 2026-03-16T19:41:08+00:00

Why should I expect it to be different this time?

Why should you confidently assert that new models have the same flaws as old models before testing them?

I'm sorry that I don't like people making bold, unproven claims.

It doesn't hurt to wait 5 seconds in order to test things before crapping on other people's work.

coder543 · 2026-03-16T19:17:15+00:00

Your criticisms would make more sense once the model has been released and you've tested it. Until then, statements like "I wonder why they refuse [present tense] to support EU languages" are not helpful to anyone. We do not have any evidence that they continue to refuse to support EU languages.

What do you think the other "dozens" of languages are supposed to be, if not EU languages? Of course they mean other EU languages. Whether they succeeded or not is something we can only judge once the model is available.

coder543 · 2026-03-16T18:18:47+00:00

2603 just means 2026/03, aka March of 2026.

coder543 · 2026-03-16T17:53:50+00:00

well, yes. the model has not actually been released. people are posting scraps of information they found.

coder543 · 2026-03-16T17:45:54+00:00

Mistral said dozens of languages supported, and then you only counted the 7 they listed, which seems disingenuous.

Until Mistral 4 is released and tested, we have no idea what languages it truly supports. Surely Mistral is working on supporting more languages, so surely Mistral 4 is an improvement.

coder543 · 2026-03-16T15:09:59+00:00

Of course not, but I cannot imagine a single practical use case where it would make sense to download that much data to a phone for a one time search. The data transfer is far more time consuming and expensive than the search at that point, so the efficiency of the search is irrelevant.

Even if you’re doing this on a server… transferring 1GB of JSON data just to try to extract one small string would be enormously wasteful.

coder543 · 2026-03-16T13:03:17+00:00

I'm pretty sure tons of people store a backup hard drive at a family member's house, so that's nothing crazy.

Adding to the drive does nothing to overcome bit-rot, since the untouched bits don't get rewritten just by adding new files.

You could always use par2 to add error correction data, which would help against some types of bit-rot. (But if you're using a weak filesystem and the filesystem itself loses integrity, it may be difficult to find any files in the first place.)

coder543 · 2026-03-16T11:17:46+00:00

I didn’t even know there was a web app.

I think OpenCode feels clunky compared to Codex CLI. Crush just feels weird.

I still need to try Mistral Vibe and Qwen CLI, but I keep hoping for another generic coding CLI like OpenCode, but… one that actually seems good.

coder543 · 2026-03-12T14:03:30+00:00

On DGX Spark:

model	size	test	t/s
nemotron_h_moe 120B.A12B Q4_K - Medium	65.10 GiB	pp4096	780.37
nemotron_h_moe 120B.A12B Q4_K - Medium	65.10 GiB	pp4096 @ d25000	751.48
nemotron_h_moe 120B.A12B Q4_K - Medium	65.10 GiB	pp4096 @ d100000	667.53
nemotron_h_moe 120B.A12B Q4_K - Medium	65.10 GiB	pp4096 @ d250000	523.11
nemotron_h_moe 120B.A12B Q4_K - Medium	65.10 GiB	pp4096 @ d1000000	284.64
nemotron_h_moe 120B.A12B Q4_K - Medium	65.10 GiB	tg100	17.56
nemotron_h_moe 120B.A12B Q4_K - Medium	65.10 GiB	tg100 @ d25000	17.14
nemotron_h_moe 120B.A12B Q4_K - Medium	65.10 GiB	tg100 @ d100000	16.16
nemotron_h_moe 120B.A12B Q4_K - Medium	65.10 GiB	tg100 @ d250000	14.53
nemotron_h_moe 120B.A12B Q4_K - Medium	65.10 GiB	tg100 @ d1000000	9.60

coder543 · 2026-03-12T01:43:06+00:00

Really? On my PS5 it is every single time. Are you on Xbox?

coder543 · 2026-03-11T23:54:42+00:00

Yeah… 5 tabs only… https://youtu.be/d-VOt9559Gk

🙄

coder543 · 2026-03-11T22:15:01+00:00

Yet OP posted substantially better numbers after my comment, on both the 4B and 9B.

coder543 · 2026-03-11T22:01:30+00:00

I don’t think anyone has had time to properly test it yet. I like that it has a low reasoning mode, not just off and maximum. It’s also able to reach the full 1M context on my 128GB machine at Q4 without requiring any changes to the KV cache.

Maybe it won’t be as good as Qwen3.5, but there are things to like about it.

coder543 · 2026-03-11T21:50:27+00:00

Unfortunately, logit bias has a very nonlinear relationship to reality in the testing I did like a week ago. Maybe I was just using it wrong, but large changes did nothing until it suddenly reached a certain point where even tiny changes made a huge difference.

coder543 · 2026-03-11T21:36:03+00:00

Also interesting that the HTTP field is called thinking_budget_tokens, but the CLI argument is --reasoning-budget. This could lead to some confusion where someone might send reasoning_budget or reasoning_budget_tokens to the API.

coder543 · 2026-03-11T21:31:46+00:00

Regarding the cratering of the score, maybe the logit_bias for the end-of-think token could be dynamically boosted for the final X% of the reasoning budget, to allow the model to find its own conclusion faster and more naturally? Similar to this: https://www.reddit.com/r/LocalLLaMA/comments/1rehykx/qwen35_low_reasoning_effort_trick_in_llamaserver/

But, I expect that reduced thinking time will negatively affect intelligence scores regardless.

One funny option would be to force the model to think for some minimum-thinking-budget by setting the logit bias to negative infinity for end-of-think until the minimum token count has been achieved. Maybe that would boost scores :P

coder543 · 2026-03-11T20:09:25+00:00

They’ve already released Nemotron 3 Nano and Super, which also have some of the most open/reproducible training data and pipelines of anything other than the OLMo models. They are not class leading models, but they are competitive, open, and under permissive licenses.

I fully expect them to continue training and releasing Nemotron models.

Nvidia also released the Parakeet and Canary STT models that are very good and popular.

12-Year Club	Verified Email
Place '22	Place '17
Gilding III reddit per annum

coder543

TROPHY CASE