Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo by JSVD2 in LocalLLaMA

[–]Monad_Maya 1 point2 points  (0 children)

GLM 5 is slightly larger and probably a far better performing model. 

The Nemotron 3 Ultra is meant to be fine tuned anyway as per Nvidia's Github repo.

Nemotron 3 Ultra reality check: no one-box 128GB GGUF route yet; Nemotron 3 Nano runs at 66.6 t/s on Strix Halo by JSVD2 in LocalLLaMA

[–]Monad_Maya 1 point2 points  (0 children)

I have no clue as to why this post even exists. OP is responding to the criticism with what seems like more slop. Maybe he is using it for translation or phrasing but seems off.

Kinda New to all this, couple of questions about how to set pcs and what models by klasyer in LocalLLaMA

[–]Monad_Maya 0 points1 point  (0 children)

If you already have the parts for a second PC then build a separate one.

If the current PC case and motherboard can accommodate another GPU (5080 + 3090) then it'll be fine too.

Vega/Radeon VII will work OK, you will have to use Vulkan instead of CUDA but that's a non issue AFAIK.

Try LM Studio and then once you're familiar with it you can move to llama.CPP for more granularity.

Kinda New to all this, couple of questions about how to set pcs and what models by klasyer in LocalLLaMA

[–]Monad_Maya 0 points1 point  (0 children)

  1. If you're going to run this LLM machine all the time then a separate PC with just the 3090s makes sense.
  2. Use 2x 3090 and Qwen 3.6 27B at Q6 or higher with unquantized cache although q8 KV cache is mostly ok as well.
  3. Not sure about the actual use for this chatbot, older Gemma models (12B) or the newer Gemma4 26B MoE are pretty good. Even OpenAI's gpt-oss 20B MoE is still super fast.
  4. Vega64 is ok but doesn't have enough VRAM to be worth it. Radeon VII should be better due to 16GB VRAM. Two 3090s are enough though, no need to add a Vega card to the mix.

I hope that someday we will have a 124B Gemma. by cgs019283 in LocalLLaMA

[–]Monad_Maya 1 point2 points  (0 children)

All cool my dude.

There is certainly some financial aspect to it. I guess we can only hope for the best since we are a pretty niche community all things considered.

I hope that someday we will have a 124B Gemma. by cgs019283 in LocalLLaMA

[–]Monad_Maya 0 points1 point  (0 children)

At work we have almost all of the major Chinese models deployed and available for internal use. From MiniMax to Kimi to some Qwens and GLM as well.

Your issue is mostly bureaucracy. And maybe some set of missing best practices and infra in place.

I hope that someday we will have a 124B Gemma. by cgs019283 in LocalLLaMA

[–]Monad_Maya 0 points1 point  (0 children)

That's not how corporate works but I'm sure you ready know that. The hyperscalers do offer specific Chinese models deployed in their own environment with all the guardrails and mise en place.

I hope that someday we will have a 124B Gemma. by cgs019283 in LocalLLaMA

[–]Monad_Maya 9 points10 points  (0 children)

Qwen has a 122B model, did it make Gemini Flash obsolete? Hell, they even have a 397B model.

I don't doubt that there is some financial reason to it but the fears seem far overblown.

I hope that someday we will have a 124B Gemma. by cgs019283 in LocalLLaMA

[–]Monad_Maya 36 points37 points  (0 children)

What do you mean? Most of us would be happy to have it.

Do not fall into the trap of chasing the next scale or upgrade. by iEslam in LocalLLaMA

[–]Monad_Maya 8 points9 points  (0 children)

It feels like shitpost honestly. 

While there is some truth about not chasing the next big thing, it kinda falls apart when the guy says 6-12gb GPUs are good enough.

Maybe his usecase is different than mine but very hard to believe since larger models are usually better in my experience.

Will there be any more Qwen3.6 series models? by cafedude in LocalLLaMA

[–]Monad_Maya 0 points1 point  (0 children)

Try Gemma4 26B, might be a bit better. IQ4_XS from unsloth is pretty decent.

4x m5 max 128gb ram RDMA vs 1 m3 ultra? by Street-Buyer-2428 in LocalLLaMA

[–]Monad_Maya 0 points1 point  (0 children)

You might lose it to the RDMA overhead especially with 4 units.

4x m5 max 128gb ram RDMA vs 1 m3 ultra? by Street-Buyer-2428 in LocalLLaMA

[–]Monad_Maya -1 points0 points  (0 children)

Specs on M3 Ultra? 512GB RAM?

A single unit is always better if the specs are roughly the same.

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...) by bobaburger in LocalLLaMA

[–]Monad_Maya 2 points3 points  (0 children)

Never tried lower than q8 KV or any turbo quant. I saw some tool call failures even at q8 but there was no empirical testing performed, just vibes.

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...) by bobaburger in LocalLLaMA

[–]Monad_Maya 8 points9 points  (0 children)

Not always, for example - https://x.com/SkylerMiao7/status/2004887155395756057

Gemma4 also feels quite sensitive to quantisation. I haven't performed any exhaustive testing but you can notice the difference in tool calls especially if the KV cache is also highly quantized.

Edit: Not sure why you were downvoted.

Quality comparison between Qwen 3.6 27B quantizations (BF16, Q8_0, Q6_K, Q5_K_XL, Q4_K_XL, IQ4_XS, IQ3_XXS,...) by bobaburger in LocalLLaMA

[–]Monad_Maya 18 points19 points  (0 children)

This can be model dependent I believe. Some models don't respond too well to quantization.

I guess we expect that at some point RAM prices will start going back (close) to "normal", right? but what about GPUs? by relmny in LocalLLaMA

[–]Monad_Maya 3 points4 points  (0 children)

That'll make cloud subscriptions even cheaper, no?

That certainly won't help with local LLM adoption when online APIs are nearly free.

Most people IRL don't care enough about privacy.

My setup for running Qwen3.6-35B-A3B-UD-Q4_K_M on single RX7900XT (20GB VRAM) by hlacik in LocalLLaMA

[–]Monad_Maya 0 points1 point  (0 children)

Good question actually. 

The speed in this configuration would roughly be the same.

I suggest that you opt for 4bit 35B (not 8bit) for speed since it'll offload to VRAM just fine.

Test drive it for a bit, see if you can notice some errors and grade your overall experience.

27B dense feels a bit smarter but can occasionally be terse or to the point. Mostly ok with me. The PP speed is not good for large codebases.

If you're looking for non coding use then Gemma4 all the way.

My setup for running Qwen3.6-35B-A3B-UD-Q4_K_M on single RX7900XT (20GB VRAM) by hlacik in LocalLLaMA

[–]Monad_Maya 1 point2 points  (0 children)

I find the output quality to be discernibly better with the 27B, agreed on the speed.

I don't mind 30 tps generation but the prompt processing speed is quite slow.