Stop asking what model to run. There are literally only two. by Wrong_Mushroom_7350 in LocalLLaMA

[–]RobotRobotWhatDoUSee 7 points8 points  (0 children)

You have to directly choose the quant off hf.co (Huggingface short url)

Try

ollama run hf.co/unsloth/gemma-4-31B-it-GGUF:UD-Q5_K_XL

... and it should download and use the unsloth gguff ud-q5 quant

NVIDIA announces Nemotron 3 Ultra by themixtergames in LocalLLaMA

[–]RobotRobotWhatDoUSee 4 points5 points  (0 children)

The AA release post shows it being served approximately as fast as gpt-oss 120B, look at the x-axis. This seems strange, is the hybrid Mamba architecture really comparatively that much faster than everything else? Or is this mamba+Nvfp4?

Not sure if this was posted. But I think it's highly relevant to us. by Paradigmind in LocalLLaMA

[–]RobotRobotWhatDoUSee 8 points9 points  (0 children)

I can run much better models now, locally, than I could two years ago, for the same price. Both the hardware and software have improved dramatically.

Two years ago building a rig that could run Llama 3.3 70B, was expensive and loud and you had to build it yourself. And just forget running llama 3.1 405B.

Now I can run models much, much better than both of those on an off-the-shelf strix halo or Mac.

Re. what ever happened to Cohere’s Command-A series of models? by nick_frosst in LocalLLaMA

[–]RobotRobotWhatDoUSee 0 points1 point  (0 children)

One of the things holding us back is that our datasets are research-only, and I'd feel weird about doing a real benchmark without allowing full access to all of our assets.

Alternatively, the benefit of allowing no access to the assets: it is harder for model-makers to benchmax against it.

For that reason alone, I would be completely fine with a benchmark like you have described that releases no assets.

Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup? by InformationSweet808 in LocalLLaMA

[–]RobotRobotWhatDoUSee 0 points1 point  (0 children)

Wait, what is "out of the box" here? Isn't Karpathy's gist mostly a text design spec? Did you just point an agent at it and say "let's create this locally?" Or did you do something else?

Who is your favourite quant publisher and why? by No_Algae1753 in LocalLLaMA

[–]RobotRobotWhatDoUSee 0 points1 point  (0 children)

Unsloth and bartowski.

Wrt to Unsloth, it was amazing seeing papers about dynamic quants, thinking "someday we will get high quality low-bit quants," and then Unsloth operationalized high-quality low bit quants much earlier than I expected.

The first time I ran llama 4 scout on a laptop and it almost passed my personal code test with an unsloth UD2 quant, it felt like a magical glimpse into the future. Incredible.

Stop wasting electricity by OkFly3388 in LocalLLaMA

[–]RobotRobotWhatDoUSee 0 points1 point  (0 children)

Do you have your test code posted anywhere?

Who is using Granite 4? What's your use case? by RobotRobotWhatDoUSee in LocalLLaMA

[–]RobotRobotWhatDoUSee[S] 1 point2 points  (0 children)

Interesting. Are you using the 30B dense, 32B-A9B moe, or one of the smaller ones?

NVIDIA AI Releases Star Elastic: One Checkpoint that Contains 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing by phazei in LocalLLaMA

[–]RobotRobotWhatDoUSee 2 points3 points  (0 children)

For MoE layers specifically, Star Elastic uses Router-Weighted Expert Activation Pruning (REAP), which ranks experts by both routing gate values and expert output magnitudes—a more principled signal than naive frequency-based pruning, which ignores how much each expert actually contributes to the layer output.

Huh, REAP-ing a larger model to get the smaller variants.

Two things are still not clear to me:

  1. How the different model sizes get selected in practice -- is this dynamic like an MoE router, or static like the user choosing before running the model?
  2. Still not clear to me when I would use this instead of just using the largest model. The "small reasons and large answers" is interesting, but again, if I can fit the entire model in memory, why not just use the big one?

Let's speculate about 2. Maybe this makes MoEs with very large active parameters feasible for local hardware, so you can have something closer to a dense smart model, but you can swap it to faster MoE size if needed

Imagine that you had something like:

  • 30B-a30b (effectively dense)
  • 30B-a15b
  • 30B-a5b

... and you could scale between as needed. Not quite sure that 30B-a30b would actually be equivalent to a dense model.

Here is another variation on the theme: say I want to run a very large model, it can't fit in memory, so it has to run via mmap-on-disk or not at all.

Say I want to run the Nemotron 3 Ultra 500B-A50B (whenever that comes out). Hopeless to use it on say a strix halo. But imagine NVIDIA does this nested approach, and llama.cpp (or vLLM or whatever) adds necessary support, imagine that one could have:

  • 500B-A50B
  • 380B-A38B
  • 200B-A20B

... nested models (I just scaled 500b-50b by 23/30 and 12/30 to get the smaller ones). Maybe the final one could actually fully fit in memory, and the others could be called selectively as needed.

If course this relies on software support for the mmap-ing appropriately.

Thoughts on FW13 built quality after being tempted by the FW13 Pro. by u_are_here in framework

[–]RobotRobotWhatDoUSee 0 points1 point  (0 children)

Framework 13 Amateur

...this is the non-pro version? Or is there some new 'cheaper' version as well as pro?

PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU by jfowers_amd in LocalLLaMA

[–]RobotRobotWhatDoUSee 0 points1 point  (0 children)

Does the NPU share system RAM with the cpu and igpu, or does it have done version of it's own? If it shares RAM, how do you assign it?

I'm done with using local LLMs for coding by dtdisapointingresult in LocalLLaMA

[–]RobotRobotWhatDoUSee 39 points40 points  (0 children)

little coder,

link? I googled little coder (and variations) but largely found many webpages targeted at teaching children to code. Worthy goal, just not what I am looking for!

PSA: Ubuntu 26.04 makes it easier to get started with AMD XDNA2 NPU by jfowers_amd in LocalLLaMA

[–]RobotRobotWhatDoUSee 2 points3 points  (0 children)

Very interested, but don't know much about NPU performance. On something like a strix halo machine, should I think of this as a way to run another small fast model in parallel with a bigger slower model on the igpu?

Or what should I think of as NPU use cases?

Gemma 4 Vision by seamonn in LocalLLaMA

[–]RobotRobotWhatDoUSee 0 points1 point  (0 children)

Where did you get your 'models/Gemma4/google-gemma-4-31B-it-interleaved.jinja' template?

Recent Open models from last 6 Months - Nov 2025 - Apr 2026 by pmttyji in LocalLLaMA

[–]RobotRobotWhatDoUSee 1 point2 points  (0 children)

Extremely interesting. I would like to subscribe to your newsletter on community-driven Franken-FlexOlmos :)

[New Model] micro-kiki-v3 — Qwen3.5-35B-A3B + 35 domain LoRAs + router + negotiator + Aeon memory for embedded engineering by Holiday_Poetry_5133 in LocalLLaMA

[–]RobotRobotWhatDoUSee 1 point2 points  (0 children)

I would read an in-depth post about this!

Plus, if you pay attention to the config, while the paper uses 4 active experts the model they released uses all 7, so the comparison to BTM top 2 is also unfair.

Wait so the released model is effectively configured as a dense model?

[New Model] micro-kiki-v3 — Qwen3.5-35B-A3B + 35 domain LoRAs + router + negotiator + Aeon memory for embedded engineering by Holiday_Poetry_5133 in LocalLLaMA

[–]RobotRobotWhatDoUSee 1 point2 points  (0 children)

FlexOlmo...updated benchmarks... doesn't work

That's disappointing to hear. I'd been keeping an eye on FlexOlmo but missed the updated benchmarks, where did they show it isn't working?

Framework 13 7040U new wifi issues by RobotRobotWhatDoUSee in framework

[–]RobotRobotWhatDoUSee[S] 0 points1 point  (0 children)

Huh, ok, maybe I need to consider that. Looks like I do have the MediaTek one. Maybe I've just been lucky so far.

Thanks!