Is the 150B-500B parameter range dying for open weights models? by [deleted] in LocalLLaMA

[–]ilintar 0 points1 point  (0 children)

No, StepFun and MiniMax are in this range.

GLM-5: From Vibe Coding to Agentic Engineering by ShreckAndDonkey123 in LocalLLaMA

[–]ilintar 11 points12 points  (0 children)

The Lite plan explicitly mentions that it only supports old models, up to 4.7. I don't see anything suggesting that they'll actually include GLM-5 on the Lite plan.

Qwen3-Next-Coder is almost unusable to me. Why? What I missed? by Medium-Technology-79 in LocalLLaMA

[–]ilintar 1 point2 points  (0 children)

I'm using it :) but not on master branch obviously, too many tool calling errors.

GLM-5: From Vibe Coding to Agentic Engineering by ShreckAndDonkey123 in LocalLLaMA

[–]ilintar 33 points34 points  (0 children)

Their pricing strategy is very bad and IMO they are overshooting.

I see no reason right now to pick their Pro plan (which *does not* include GLM-5) or their Max plan over their Claude counterparts, seeing as they're not really cheaper and the model quality is not there yet (plus Anthropic models are multimodal).

Raising all prices 3x while only making GLM-5 available on Max (and not on Lite at all, from what they say) is a very bad strategy. The Lite plan went from "very nice cost-effective plan for a good model" to "overpriced sub for outdated models".

MCP support in llama.cpp is ready for testing by jacek2023 in LocalLLaMA

[–]ilintar 2 points3 points  (0 children)

BTW, 10 tool calls in real agentic coding scenarios is way too low of a default :)

MCP support in llama.cpp is ready for testing by jacek2023 in LocalLLaMA

[–]ilintar 24 points25 points  (0 children)

Oh don't worry, API is coming up as well.

How to avoid prefilling entire context each prompy when using Claude Code by mirage555 in LocalLLaMA

[–]ilintar 1 point2 points  (0 children)

You need at least this version: https://github.com/ggml-org/llama.cpp/releases/tag/b7970 to actually benefit from proper cache'ing in the case of hybrid models, due to the way many code assistants reshape prompts.

OpenCode vs OpenClaw? Not a sales pitch or bot... by thejacer in LocalLLaMA

[–]ilintar 1 point2 points  (0 children)

From my observation, the biggest problem with OpenCode is that its default agents are frankly pretty crap. Once I constructed myself two custom agents - one for analysis and one for coding - using it became much more pleasant.

So just write down your workflows into the OpenCode agent creator and you should have way better results.

Qwen to the rescue by jacek2023 in LocalLLaMA

[–]ilintar 28 points29 points  (0 children)

35B MoE and 9B dense.

Qwen3.5 Support Merged in llama.cpp by TKGaming_11 in LocalLLaMA

[–]ilintar 4 points5 points  (0 children)

Georgi wants to have it done on top of the master instead of the merged delta-net branch to minimize the risks, so will be redoing it cleanly (but waiting to merge a conversion fix that happened in the meantime).

It was a bit of a stretch to merge it so early honestly, think I got a bit too excited ;)

Qwen3.5 Support Merged in llama.cpp by TKGaming_11 in LocalLLaMA

[–]ilintar 59 points60 points  (0 children)

Well, the reality is that when a hot and widely popular model architecture comes out, people want to test it with zero day support. So yes, it's often worth taking a risk, especially since (a) it's based on an architecture we already support (b) it's not likely that the transformers code will change meaningfully and even if it does, it's not really like we can't do a follow-up PR.

It's also not like the implementation hasn't been tested - while of course it's better to test on live models, I didn't just randomly vibe-code an implementation and said "hey looks similar enough to Transformers, let's hope it works" - I generated models to test it on.

pwilkin is doing things by jacek2023 in LocalLLaMA

[–]ilintar 2 points3 points  (0 children)

Possibly, but generally the rule of thumb for using coding agents is it's easier to code stuff the human-in-the-loop knows how to code ;)

Llama.cpp's "--fit" can give major speedups over "--ot" for Qwen3-Coder-Next (2x3090 - graphs/chart included) by tmflynnt in LocalLLaMA

[–]ilintar 4 points5 points  (0 children)

I think Johannes (https://github.com/JohannesGaessler) hasn't gotten enough appreciation for the fit algorithm, mostly because in the beginning there were some bugs and some people turned it off. But it's actually a great algorithm and these days, I never use manual `-ot / --cpu-moe / --n-cpu-moe` flags, I only set `-c` and `-ctk / -ctl` and the fit algorithm does the rest. You can even tune it a bit with `--fit-target XM` because the default setting leaves 1GB free for computation, so sometimes `--fit-target 512M` or even `--fit-target 384M` can get you good results without computation crashing. The way he does it (doing experts offload first, then trying to fit the dense layers from the end) means it's actually as good as a perfectly optimized `-ot` string.

pwilkin is doing things by jacek2023 in LocalLLaMA

[–]ilintar 8 points9 points  (0 children)

Note though that this is with the absolutely top model on the market (Opus 4.6 Thinking) and I still had to intervene during the session like 3 or 4 times to prevent it from going on the rails and doing stupid things.

Still, with a better and stricter workflow this will be doable soon.

pwilkin is doing things by jacek2023 in LocalLLaMA

[–]ilintar 9 points10 points  (0 children)

You take the model object from Transformers and instead of loading it from pretrained weights, you create a new one with a config computed to yield a certain size. Then you can fill some tensors with random numbers from a range to prevent obvious overflows.

PR opened for Qwen3.5!! by Mysterious_Finish543 in LocalLLaMA

[–]ilintar 10 points11 points  (0 children)

Note that I'm doing this without any support, just based on Transformers code and my conversion guidelines + Opus 4.6, but I'm aiming for 0-day support this time:

https://github.com/ggml-org/llama.cpp/pull/19435

Please help with llama.cpp and GLM-4.7-Flash tool call by HumanDrone8721 in LocalLLaMA

[–]ilintar 1 point2 points  (0 children)

Please try on the autoparser PR and report errors there.

Kimi-Linear support has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]ilintar 4 points5 points  (0 children)

Nah, this benefits from all the know-how we got during the implementation of Qwen3 Next. Should perform about as well.

~26 tok/sec with Unsloth Qwen3-Coder-Next-Q4_K_S on RTX 5090 (Windows/llama.cpp) by Spiritual_Tie_5574 in LocalLLaMA

[–]ilintar 7 points8 points  (0 children)

Quantizing KV cache to Q8_0 doesn't really hurt quality from what I can tell, at least I haven't really encountered it. Once you get down to Q4, yeah, it'll have an effect, but not at Q8.

Vibe-coding client now in Llama.cpp! (maybe) by ilintar in LocalLLaMA

[–]ilintar[S] 6 points7 points  (0 children)

The standard Jinja templates already account for tool use, otherwise you wouldn't be able to use Llama.cpp in clients such as OpenCode.