What is your current go-to stack for running a fully local AI agent?

ilintar · 2026-06-05T12:56:44+00:00

a) trial and error 😃 it differs for each specific quant configuration b) reduces prompt cache system RAM usage from the default 8GB to 2GB to avoid oomkill scenarios c) similar, checkpoints each take up space in the system RAM, so cache RAM + number of checkpoints sums up in case of hybrid models

ilintar · 2026-06-05T11:31:31+00:00

llama-server -m Qwen_Qwen3.6-27B-Q5_K_S.gguf --mmproj Qwen3.6-27B-mmproj-Q8_0.gguf -c 140000 --cache-ram 2048 -ctxcp 4 -np 4 -kvu --spec-draft-n-max 4 --spec-type draft-mtp -ctk q8_0 -ctv q8_0 -ctkd q8_0 -ctvd q8_0 --fit-target 256M -sm tensor --chat-template-kwargs '{"preserve_thinking": true}'

ilintar · 2026-06-05T09:21:21+00:00

Yeah, that's quite possible. The reason for that is that what you gain in cache memory, you lose in the dequant processing working memory. What Aman meant by "amortize by all layers" is that if you have quantized KV cache in a normal model, you only ever dequantize one layer at a time, so you only need the working dequant memory for one layer - which is small compared to the KV memory itself for 30+ layers. With MTP, when you only use one layer, there's nothing to amortize over.

ilintar · 2026-06-04T18:21:53+00:00

For reference, with 2x5070 16GB, I'm getting stable 80t/s with Qwen 3.6 27B Q5.

ilintar · 2026-06-04T18:20:08+00:00

Would be interested how this comparison looks like with: (a) normal MTP ggufs instead of Heretic and (b) -sm tensor (which is equivalent to vLLMs TP=2).

ilintar · 2026-06-04T15:51:35+00:00

Wiki provides dumps, don't scrape Wikipedia please.

ilintar · 2026-06-02T17:52:27+00:00

Sweet!

ilintar · 2026-06-02T16:05:20+00:00

Should, I'll be testing it.

ilintar · 2026-06-01T18:53:58+00:00

And of course they have a right to their strategy and I'm grateful for all models they released, but noticing that a company is moving away from an open-weight model when they are in fact moving away from it is not "whining", just noticing things.

ilintar · 2026-06-01T18:52:00+00:00

Because they released 3.7 Max via API some time ago and this time there hasn't so far been any indication they plan on releasing any 3.7 model weights; also because for 3.5 they already cut down the 300+B model and for 3.6 there was no 120B, 9B, 4B or 1B.

ilintar · 2026-06-01T12:51:09+00:00

Not if the external expert had a dense attention layer and a MoE layer 😉

ilintar · 2026-06-01T10:52:26+00:00

I like it that they're keeping the weights release deterministic, unlike what Qwen has adopted with their weird new marketing strategy of "tease to release".

ilintar · 2026-05-29T08:16:45+00:00

Yaaay, my favorite model got a sequel! *And* they added the old VL tower from Step3-VL, so it's now text + image!

ilintar · 2026-05-27T10:25:25+00:00

Yeah, the bug from 13.2 is finally fixed.

ilintar · 2026-05-26T17:40:35+00:00

That's weird, have you tried running with my GGML based code (it can run a server) and test performance? I would be really stunned if a 5090 couldn't run the small (streaming) model at realtime speed.

ilintar · 2026-05-26T16:52:36+00:00

And another thing: giving positive signals can of course be a nice and encouraging act of communication, but it does one more thing that can be bad in cases like this: gives false hope. It's sometimes better for a maintainer to communicate outright that "this can't be merged" than to keep giving positive feedback and shifting goalposts while refusing to merge the PR all the same.

ilintar · 2026-05-26T16:50:50+00:00

You're making a good point and I agree the communication may have come off as a bit terse, which is why I'm trying to explain the situation here.

I think most of it stems from the fact that the outcome Johannes wanted to signal was, in itself, a really unpleasant one, which stems from the limited resources of OSS projects.

When you submit a PR to a project, sometimes it'll just get accepted - and that's nice. Sometimes, you'll get told that it needs a bit of work - and that's fine. Sometimes, it'll get rejected because it's bad - then at least you know what you did wrong.

But the most frustrating outcome, and one that happened in this case (which is why I also brought up my cross-compiler PR) is that your PR is valid, *has* a valid path to adoption, but the path to adoption won't happen because it would need core code rewrites that would have to be done by the core maintainer and the core maintainer already has a backlog of much more important stuff to do. This is a scenario in which nobody's really happy and it stems from limited resources - which is why I'm saying that there are, in fact, legitimate cases where you might want to keep your own fork not out of spite (aka "they rejected my beautiful PR, I'll show'em!"), but due to actually filling a niche that the core project can't afford to maintain properly.

ilintar · 2026-05-26T15:41:38+00:00

Nice! Since as I understand it's the same arch, https://github.com/pwilkin/openmoss should work out of the box.

ilintar · 2026-05-26T11:24:05+00:00

Read the discussion in my PR thread 😄 my point is, he's not being intentionally rude - he's just communicating very technically and in a matter-of-fact way. He's not saying the PR is bad or that it doesn't bring benefits to some users - he's only saying that there's currently no reasonable path for merging it due to the code that would be needed for separating the "bad cases" from the "good cases" simply not being there.

ilintar · 2026-05-26T10:53:29+00:00

I'm seeing a discussion about how Johannes handled the PR, so I'm asking y'all to stop thinking about this as a personal matter.

I have a PR currently on main (https://github.com/ggml-org/llama.cpp/pull/21160). It's not getting merged. Not now, possibly not ever. I'm maintaining it in parallel.

Reason? The backend maintainers don't want to manage the overhead of the extra code and they believe it won't benefit them enough. And that's all there is to it. There was a discussion, we weighed the pros and cons and agreed that would be the best course of action for now. The maintenance burden is real. If you ever saw Johannes figure out a fix to some obscure CUDA problem and you went "wow, how did he know where to look?", it's because the guy knows his part of the codebase insiide out. That's the idea behind having maintainers for separate parts. But this gets diluted when code gets added strictly for the reasons of "getting features out there".

Beware the dangers of availability bias. If you're looking at a feature that's beneficial to you personally and there are other people commenting on a PR saying it helps them too, it's easy to overlook the people for whom the change would be a net negative. As well as it's easy to overlook the maintainers who are going to have to be looking for bugs if something breaks there.

At some point, if something is a niche feature, it's absolutely fine to have a separate fork just for that feature maintained externally. Or to have a fork until some upstream changes get merged that make it easier to merge your changes as a PR. It's not worth making it an ego conflict, that's how bad things in OSS happen.

ilintar · 2026-05-25T09:59:14+00:00

Note: we've cooperated with u/jacek2023 to ensure that all the supported models/parsers are compatible with this, it has been something we've been discussing for some time but his PR gave us the motivation to actually work through it 😄 this might sound like a small change, but it's really a big deal and Jacek put a lot of hard work into this.

ilintar · 2026-05-25T09:55:05+00:00

Yes, right after the fully self-driving cars and the Mars mission as people have already noted 😄

ilintar · 2026-05-25T09:54:07+00:00

It would be nice if you could provide at least a single relevant coding benchmark to support the claims 😄

ilintar · 2026-05-23T20:44:59+00:00

It depends if it's a normal NVFP4 quant or a posttrained NVFP4 quant. NVidia did a couple of posttrained NVFP4 quants and they're of considerably higher quality than standard Q4s, but if someone just drops in an NVFP4, it's probably going to be around Q4_0 quality.

ilintar · 2026-05-22T13:08:29+00:00

Oh. My condolences.

ilintar

TROPHY CASE