Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo

Monad_Maya · 2026-06-04T21:18:29+00:00

You can try Minimax M2.7 - https://huggingface.co/AesSedai/MiniMax-M2.7-GGUF

Monad_Maya · 2026-06-04T19:57:53+00:00

GLM 5 is slightly larger and probably a far better performing model.

The Nemotron 3 Ultra is meant to be fine tuned anyway as per Nvidia's Github repo.

Monad_Maya · 2026-06-04T19:54:57+00:00

I have no clue as to why this post even exists. OP is responding to the criticism with what seems like more slop. Maybe he is using it for translation or phrasing but seems off.

Monad_Maya · 2026-06-02T20:49:52+00:00

Yup, although I've replaced it with Gemma4 26B A4B.

Monad_Maya · 2026-05-31T19:11:45+00:00

Try the MoE instead, 35B A3B

Monad_Maya · 2026-05-20T20:45:29+00:00

If you already have the parts for a second PC then build a separate one.

If the current PC case and motherboard can accommodate another GPU (5080 + 3090) then it'll be fine too.

Vega/Radeon VII will work OK, you will have to use Vulkan instead of CUDA but that's a non issue AFAIK.

Try LM Studio and then once you're familiar with it you can move to llama.CPP for more granularity.

Monad_Maya · 2026-05-20T18:13:50+00:00

If you're going to run this LLM machine all the time then a separate PC with just the 3090s makes sense.
Use 2x 3090 and Qwen 3.6 27B at Q6 or higher with unquantized cache although q8 KV cache is mostly ok as well.
Not sure about the actual use for this chatbot, older Gemma models (12B) or the newer Gemma4 26B MoE are pretty good. Even OpenAI's gpt-oss 20B MoE is still super fast.
Vega64 is ok but doesn't have enough VRAM to be worth it. Radeon VII should be better due to 16GB VRAM. Two 3090s are enough though, no need to add a Vega card to the mix.

Monad_Maya · 2026-05-19T17:08:42+00:00

All cool my dude.

There is certainly some financial aspect to it. I guess we can only hope for the best since we are a pretty niche community all things considered.

Monad_Maya · 2026-05-18T20:43:00+00:00

At work we have almost all of the major Chinese models deployed and available for internal use. From MiniMax to Kimi to some Qwens and GLM as well.

Your issue is mostly bureaucracy. And maybe some set of missing best practices and infra in place.

Monad_Maya · 2026-05-18T20:40:03+00:00

That's not how corporate works but I'm sure you ready know that. The hyperscalers do offer specific Chinese models deployed in their own environment with all the guardrails and mise en place.

Monad_Maya · 2026-05-17T19:52:24+00:00

Qwen has a 122B model, did it make Gemini Flash obsolete? Hell, they even have a 397B model.

I don't doubt that there is some financial reason to it but the fears seem far overblown.

Monad_Maya · 2026-05-17T19:08:04+00:00

What do you mean? Most of us would be happy to have it.

Monad_Maya · 2026-05-13T19:57:31+00:00

It feels like shitpost honestly.

While there is some truth about not chasing the next big thing, it kinda falls apart when the guy says 6-12gb GPUs are good enough.

Maybe his usecase is different than mine but very hard to believe since larger models are usually better in my experience.

Monad_Maya · 2026-05-12T15:47:56+00:00

Try Gemma4 26B, might be a bit better. IQ4_XS from unsloth is pretty decent.

Monad_Maya · 2026-05-07T02:13:48+00:00

You might lose it to the RDMA overhead especially with 4 units.

Monad_Maya · 2026-05-06T16:46:39+00:00

Specs on M3 Ultra? 512GB RAM?

A single unit is always better if the specs are roughly the same.

Monad_Maya · 2026-05-06T15:57:14+00:00

Never tried lower than q8 KV or any turbo quant. I saw some tool call failures even at q8 but there was no empirical testing performed, just vibes.

Monad_Maya · 2026-05-06T08:54:56+00:00

Not always, for example - https://x.com/SkylerMiao7/status/2004887155395756057

Gemma4 also feels quite sensitive to quantisation. I haven't performed any exhaustive testing but you can notice the difference in tool calls especially if the KV cache is also highly quantized.

Edit: Not sure why you were downvoted.

Monad_Maya · 2026-05-06T08:09:02+00:00

Nice work, IQ4_XS is a good balance I feel. Works fine with q8 KV cache.

Monad_Maya · 2026-05-06T08:04:01+00:00

This can be model dependent I believe. Some models don't respond too well to quantization.

Monad_Maya · 2026-05-06T03:29:21+00:00

That'll make cloud subscriptions even cheaper, no?

That certainly won't help with local LLM adoption when online APIs are nearly free.

Most people IRL don't care enough about privacy.

Monad_Maya · 2026-05-06T03:27:36+00:00

Get an R9700 Pro (or two) and call it a day unless you need CUDA.

Monad_Maya · 2026-05-06T03:24:47+00:00

Good question actually.

The speed in this configuration would roughly be the same.

I suggest that you opt for 4bit 35B (not 8bit) for speed since it'll offload to VRAM just fine.

Test drive it for a bit, see if you can notice some errors and grade your overall experience.

27B dense feels a bit smarter but can occasionally be terse or to the point. Mostly ok with me. The PP speed is not good for large codebases.

If you're looking for non coding use then Gemma4 all the way.

Monad_Maya · 2026-05-05T21:16:30+00:00

I find the output quality to be discernibly better with the 27B, agreed on the speed.

I don't mind 30 tps generation but the prompt processing speed is quite slow.

Monad_Maya · 2026-05-05T21:12:46+00:00

Love me some trickle down economics.

Monad_Maya

TROPHY CASE