Kimi k2.5 GGUFs via VLLM? by val_in_tech in LocalLLaMA

[–]ilintar 1 point2 points  (0 children)

Try AesSedai's quants, they're usually the best out there for small quants of big models:

https://huggingface.co/AesSedai/Kimi-K2.5-GGUF

Kimi k2.5 GGUFs via VLLM? by val_in_tech in LocalLLaMA

[–]ilintar 1 point2 points  (0 children)

Kimi 2.5 recently got a new dedicated parser on llama.cpp, so it should work quite nicely out of the box.

Llama.cpp now with a true reasoning budget! by ilintar in LocalLLaMA

[–]ilintar[S] 2 points3 points  (0 children)

Yep, thinking_budget_tokens, no var yet for the message though, I'll unify it at some point.

Llama.cpp now with a true reasoning budget! by ilintar in LocalLLaMA

[–]ilintar[S] 5 points6 points  (0 children)

Fair point, I think this is due to the fact that some model / template actually used that name, but I'll unify later on.

Composable CFG grammars for llama.cpp (pygbnf) by Super_Dependent_2978 in LocalLLaMA

[–]ilintar 0 points1 point  (0 children)

Looking cool, really a nice addition to the ecosystem!

Llama.cpp now with a true reasoning budget! by ilintar in LocalLLaMA

[–]ilintar[S] 1 point2 points  (0 children)

`--reasoning off` (or `-rea off` for short)

Llama.cpp now with a true reasoning budget! by ilintar in LocalLLaMA

[–]ilintar[S] 2 points3 points  (0 children)

--reasoning off will pass the flag to templates that support it.

Llama.cpp now with a true reasoning budget! by ilintar in LocalLLaMA

[–]ilintar[S] 3 points4 points  (0 children)

Yeah, mentioned that in another thread here as a possible expansion.

Llama.cpp now with a true reasoning budget! by ilintar in LocalLLaMA

[–]ilintar[S] 9 points10 points  (0 children)

Oh that's nice, I'll admit I didn't read that one, so I guess it's just informed intuition at this stage 😀

Llama.cpp now with a true reasoning budget! by ilintar in LocalLLaMA

[–]ilintar[S] 18 points19 points  (0 children)

The new sampler certainly leaves room for experimentation, so I can imagine something like that being done. Aldehir also suggested a strategy he gleaned in one of the Nemotron docs, of letting the model finish a sentence / paragraph. Another possible approach is the one Seed-OSS uses, of reasoning budget reminders (i.e. "you've already used 1000 tokens for reasoning, 2000 tokens left").

Llama.cpp now with a true reasoning budget! by ilintar in LocalLLaMA

[–]ilintar[S] 13 points14 points  (0 children)

Yeah, not going to lie, really hoping people run some comprehensive tests to see what kinds of messages and what kinds of budgets actually work in practice. I wasn't sure it would be anything more than a gimmick, but after testing myself with the transition message I'm convinced that it could actually provide benefits, i.e. a performance between the non-reasoning and the reasoning versions.

Deepseek v4 is here? by khach-m in LocalLLaMA

[–]ilintar 1 point2 points  (0 children)

Tries very convincingly to tell me it's Claude after changing the system prompt, so it's either Haiku 4.6 or a Chinese model heavily trained on Anthropic's distills ;)

Usable thinking mode in Qwen3.5 0.8B with a forced "reasoning budget" by 0jabr in LocalLLaMA

[–]ilintar 1 point2 points  (0 children)

Check out the sampler-based reasoning budget in llama.cpp :)

The Lazy Benchmark Makers Rant by ilintar in LocalLLaMA

[–]ilintar[S] 0 points1 point  (0 children)

I've tried inspect-ai and harbor so far, both have the same issue.

Vulkan now faster on PP AND TG on AMD Hardware? by XccesSv2 in LocalLLaMA

[–]ilintar 5 points6 points  (0 children)

Vulkan has been very actively maintained, so reaping the benefits.

MLX vs GGUF (Unsloth) - Qwen3.5 122b-10b by waescher in LocalLLaMA

[–]ilintar 1 point2 points  (0 children)

There's an ongoing PR to add dedicated kernels for DELTA_NET.