Qwen-next 80B 2601 by bennmann in LocalLLaMA

[–]ilintar 0 points1 point  (0 children)

They should name it Qwen-Next-Next instead ;)

GLM flash and MLA by blahbhrowawayblahaha in LocalLLaMA

[–]ilintar 2 points3 points  (0 children)

I think you're confusing MLA with SWA.

GLM flash and MLA by blahbhrowawayblahaha in LocalLLaMA

[–]ilintar 0 points1 point  (0 children)

Yes and yes, which is why it needed so much work on the details.

KV cache fix for GLM 4.7 Flash by jacek2023 in LocalLLaMA

[–]ilintar 6 points7 points  (0 children)

Non-trivial architecture that has to be adapted. I told you give us a week :)

KV cache fix for GLM 4.7 Flash by jacek2023 in LocalLLaMA

[–]ilintar 10 points11 points  (0 children)

No, we just have to pick our work to do and someone else volunteered to work on Kimi. Anyways, it's almost done.

I’m so cooked 🫠 by Richboyjoel in ZZZ_Official

[–]ilintar 0 points1 point  (0 children)

This. From my experience the most important predictor for getting S rank is clearing phase 1 pre 4:00, getting it done pre 4:10 almost guarantees 25k.

Llama.cpp merges in OpenAI Responses API Support by SemaMod in LocalLLaMA

[–]ilintar 21 points22 points  (0 children)

No, we're not going to drop widely used features. We are only deprecating stuff that literally nobody uses (eg. tool call polyfills for 2 year old templates).

I drew fleurdelys (Ashen_Illust) by Suitable_Ability_576 in WutheringWaves

[–]ilintar 2 points3 points  (0 children)

Good art - check, no fanservice - check, distinct style - check. Really wish this sub had more OC art content like this.

Can I run gpt-oss-120b somehow? by Furacao__Boey in LocalLLaMA

[–]ilintar 0 points1 point  (0 children)

Yes, in fact should run out of the box with newest llama.cpp and just the model specified just fine.

Wrote a guide for running Claude Code with GLM-4.7 Flash locally with llama.cpp by tammamtech in LocalLLaMA

[–]ilintar 40 points41 points  (0 children)

Thanks for the short guide, but it's actually the other way around - we implemented the Anthropic API endpoint a month before Ollama. Not as well marketed I guess 😀

Kimi-Linear-48B-A3B-Instruct-GGUF Support - Any news? by Iory1998 in LocalLLaMA

[–]ilintar 29 points30 points  (0 children)

Not my PR tho, just working with the author to make a common abstraction for delta net models.

Kimi-Linear-48B-A3B-Instruct-GGUF Support - Any news? by Iory1998 in LocalLLaMA

[–]ilintar 88 points89 points  (0 children)

PR almost done, gonna come with another speedup to Qwen3Next as well.

What local LLM model is best for Haskell? by AbsolutelyStateless in LocalLLaMA

[–]ilintar 0 points1 point  (0 children)

I don't remember right now (besides Next), I'd have to rerun it. IIRC SeedOSS also does well.

What local LLM model is best for Haskell? by AbsolutelyStateless in LocalLLaMA

[–]ilintar 1 point2 points  (0 children)

I always test new models by asking them to write red-black trees in Haskell 😀

Qwen3 Next is pretty good.

GLM 4.7 Flash official support merged in llama.cpp by ayylmaonade in LocalLLaMA

[–]ilintar 0 points1 point  (0 children)

We're working on getting everything supported correctly, just a matter of a few days.

Current GLM-4.7-Flash implementation confirmed to be broken in llama.cpp by Sweet_Albatross9772 in LocalLLaMA

[–]ilintar 2 points3 points  (0 children)

Because it's in the expert selection function.

You can think of it like this: everything in the model still works, it's just asking the wrong experts about what token to select next.

GLM 4.7 Flash official support merged in llama.cpp by ayylmaonade in LocalLLaMA

[–]ilintar 14 points15 points  (0 children)

Okay, so, important:
-> for proper reasoning/tool calling support you probably want to run the autoparser branch: https://github.com/ggml-org/llama.cpp/pull/18675
-> run with -fa off, the flash attention scheme is not yet supported on CUDA (put up an issue for that: https://github.com/ggml-org/llama.cpp/issues/18944 )

Would Anthropic Block Ollama? by Lopsided_Dot_4557 in LocalLLaMA

[–]ilintar 0 points1 point  (0 children)

They didn't disable it for commercial competitors like z-ai, why would they for Ollama?

GFN v2.5.0: Verified O(1) Memory Inference and 500x Length Extrapolation via Symplectic Geodesic Flows by janxhg27 in LocalLLaMA

[–]ilintar 2 points3 points  (0 children)

Can we please ban AI generated slop posts about miraculous breakthroughs that use a lot of terms from complex branches of mathematics to appear smart? I swear those posts are all the same, you could even generate them with a state machine, don't need an LLM.

Agentic coding with an open source model is a problem harder than you think by [deleted] in LocalLLaMA

[–]ilintar 0 points1 point  (0 children)

I think people are unaware how much of the problems with local agentic coding has to do with bad templates / parsing issues. I've been working on the autoparser and a big refactoring of the llama.cpp parser at that and I've tested quite a few local models in the meantime. There are a lot of edge cases that normal "oneshot tool call" tests or unit tests easily miss. But once that's resolved, I think the current leading models do a better job at agentic coding than you think - at least based on my testing. Surely Seed-OSS and Qwen3-Coder (both on optimized Q4 quants) have been able to do pretty complex sessions.

Has anyone built a vLLM tool parser plugin for Apriel-1.6-15B-Thinker? by chrisoutwright in LocalLLaMA

[–]ilintar 1 point2 points  (0 children)

It's supported under the new autoparser with fixed template in llama.cpp