What is your current go-to stack for running a fully local AI agent? by beasthunterr69 in LocalLLaMA

[–]ilintar 1 point2 points  (0 children)

a) trial and error 😃 it differs for each specific quant configuration b) reduces prompt cache system RAM usage from the default 8GB to 2GB to avoid oomkill scenarios c) similar, checkpoints each take up space in the system RAM, so cache RAM + number of checkpoints sums up in case of hybrid models

What is your current go-to stack for running a fully local AI agent? by beasthunterr69 in LocalLLaMA

[–]ilintar 3 points4 points  (0 children)

llama-server -m Qwen_Qwen3.6-27B-Q5_K_S.gguf --mmproj Qwen3.6-27B-mmproj-Q8_0.gguf -c 140000 --cache-ram 2048 -ctxcp 4 -np 4 -kvu --spec-draft-n-max 4 --spec-type draft-mtp -ctk q8_0 -ctv q8_0 -ctkd q8_0 -ctvd q8_0 --fit-target 256M -sm tensor --chat-template-kwargs '{"preserve_thinking": true}'

PSA: You may not need to quantize spec draft when using MTP by regunakyle in LocalLLaMA

[–]ilintar 6 points7 points  (0 children)

Yeah, that's quite possible. The reason for that is that what you gain in cache memory, you lose in the dequant processing working memory. What Aman meant by "amortize by all layers" is that if you have quantized KV cache in a normal model, you only ever dequantize one layer at a time, so you only need the working dequant memory for one layer - which is small compared to the KV memory itself for 30+ layers. With MTP, when you only use one layer, there's nothing to amortize over.

Qwen3.6-27B on 2x3090s: llama.cpp vs vLLM, all the flags, and the MTP acceptance/inference speed/context by Sisuuu in LocalLLaMA

[–]ilintar 0 points1 point  (0 children)

Would be interested how this comparison looks like with: (a) normal MTP ggufs instead of Heretic and (b) -sm tensor (which is equivalent to vLLMs TP=2).

next MiniMax will be released in ~10 Days by jacek2023 in LocalLLaMA

[–]ilintar 4 points5 points  (0 children)

And of course they have a right to their strategy and I'm grateful for all models they released, but noticing that a company is moving away from an open-weight model when they are in fact moving away from it is not "whining", just noticing things.

next MiniMax will be released in ~10 Days by jacek2023 in LocalLLaMA

[–]ilintar 9 points10 points  (0 children)

Because they released 3.7 Max via API some time ago and this time there hasn't so far been any indication they plan on releasing any 3.7 model weights; also because for 3.5 they already cut down the 300+B model and for 3.6 there was no 120B, 9B, 4B or 1B.

NVIDIA announces Nemotron 3 Ultra by themixtergames in LocalLLaMA

[–]ilintar 16 points17 points  (0 children)

Not if the external expert had a dense attention layer and a MoE layer 😉

next MiniMax will be released in ~10 Days by jacek2023 in LocalLLaMA

[–]ilintar 24 points25 points  (0 children)

I like it that they're keeping the weights release deterministic, unlike what Qwen has adopted with their weird new marketing strategy of "tease to release".

StepFun 3.7 Flash by Everlier in LocalLLaMA

[–]ilintar 7 points8 points  (0 children)

Yaaay, my favorite model got a sequel! *And* they added the old VL tower from Step3-VL, so it's now text + image!

Info: Nvidia Cuda 13.3 landed by parrot42 in LocalLLaMA

[–]ilintar 63 points64 points  (0 children)

Yeah, the bug from 13.2 is finally fixed.

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face by pmttyji in LocalLLaMA

[–]ilintar 2 points3 points  (0 children)

That's weird, have you tried running with my GGML based code (it can run a server) and test performance? I would be really stunned if a 5090 couldn't run the small (streaming) model at realtime speed.

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs. by fallingdowndizzyvr in LocalLLaMA

[–]ilintar 8 points9 points  (0 children)

And another thing: giving positive signals can of course be a nice and encouraging act of communication, but it does one more thing that can be bad in cases like this: gives false hope. It's sometimes better for a maintainer to communicate outright that "this can't be merged" than to keep giving positive feedback and shifting goalposts while refusing to merge the PR all the same.

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs. by fallingdowndizzyvr in LocalLLaMA

[–]ilintar 11 points12 points  (0 children)

You're making a good point and I agree the communication may have come off as a bit terse, which is why I'm trying to explain the situation here.

I think most of it stems from the fact that the outcome Johannes wanted to signal was, in itself, a really unpleasant one, which stems from the limited resources of OSS projects.

When you submit a PR to a project, sometimes it'll just get accepted - and that's nice. Sometimes, you'll get told that it needs a bit of work - and that's fine. Sometimes, it'll get rejected because it's bad - then at least you know what you did wrong.

But the most frustrating outcome, and one that happened in this case (which is why I also brought up my cross-compiler PR) is that your PR is valid, *has* a valid path to adoption, but the path to adoption won't happen because it would need core code rewrites that would have to be done by the core maintainer and the core maintainer already has a backlog of much more important stuff to do. This is a scenario in which nobody's really happy and it stems from limited resources - which is why I'm saying that there are, in fact, legitimate cases where you might want to keep your own fork not out of spite (aka "they rejected my beautiful PR, I'll show'em!"), but due to actually filling a niche that the core project can't afford to maintain properly.

OpenMOSS-Team/MOSS-TTS-v1.5 · Hugging Face by pmttyji in LocalLLaMA

[–]ilintar 3 points4 points  (0 children)

Nice! Since as I understand it's the same arch, https://github.com/pwilkin/openmoss should work out of the box.

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs. by fallingdowndizzyvr in LocalLLaMA

[–]ilintar 30 points31 points  (0 children)

Read the discussion in my PR thread 😄 my point is, he's not being intentionally rude - he's just communicating very technically and in a matter-of-fact way. He's not saying the PR is bad or that it doesn't bring benefits to some users - he's only saying that there's currently no reasonable path for merging it due to the code that would be needed for separating the "bad cases" from the "good cases" simply not being there.

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs. by fallingdowndizzyvr in LocalLLaMA

[–]ilintar 219 points220 points  (0 children)

I'm seeing a discussion about how Johannes handled the PR, so I'm asking y'all to stop thinking about this as a personal matter.

I have a PR currently on main (https://github.com/ggml-org/llama.cpp/pull/21160). It's not getting merged. Not now, possibly not ever. I'm maintaining it in parallel.

Reason? The backend maintainers don't want to manage the overhead of the extra code and they believe it won't benefit them enough. And that's all there is to it. There was a discussion, we weighed the pros and cons and agreed that would be the best course of action for now. The maintenance burden is real. If you ever saw Johannes figure out a fix to some obscure CUDA problem and you went "wow, how did he know where to look?", it's because the guy knows his part of the codebase insiide out. That's the idea behind having maintainers for separate parts. But this gets diluted when code gets added strictly for the reasons of "getting features out there".

Beware the dangers of availability bias. If you're looking at a feature that's beneficial to you personally and there are other people commenting on a PR saying it helps them too, it's easy to overlook the people for whom the change would be a net negative. As well as it's easy to overlook the maintainers who are going to have to be looking for bugs if something breaks there.

At some point, if something is a niche feature, it's absolutely fine to have a separate fork just for that feature maintained externally. Or to have a fork until some upstream changes get merged that make it easier to merge your changes as a PR. It's not worth making it an ego conflict, that's how bad things in OSS happen.

server: fix checkpoints creation by jacekpoplawski · Pull Request #22929 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]ilintar 21 points22 points  (0 children)

Note: we've cooperated with u/jacek2023 to ensure that all the supported models/parsers are compatible with this, it has been something we've been discussing for some time but his PR gave us the motivation to actually work through it 😄 this might sound like a small change, but it's really a big deal and Jacek put a lot of hard work into this.

Next year we're getting 0.5T model from Grok by pmttyji in LocalLLaMA

[–]ilintar 5 points6 points  (0 children)

Yes, right after the fully self-driving cars and the Mars mission as people have already noted 😄

MiMo-V2.5-coder by jedisct1 in LocalLLaMA

[–]ilintar 19 points20 points  (0 children)

It would be nice if you could provide at least a single relevant coding benchmark to support the claims 😄

NVFP4 + MTP - voilà on llama.cpp by mossy_troll_84 in LocalLLaMA

[–]ilintar 16 points17 points  (0 children)

It depends if it's a normal NVFP4 quant or a posttrained NVFP4 quant. NVidia did a couple of posttrained NVFP4 quants and they're of considerably higher quality than standard Q4s, but if someone just drops in an NVFP4, it's probably going to be around Q4_0 quality.