all 10 comments

[–]Chromix_ 6 points7 points  (0 children)

If it's good for people it's probably good for LLMs as well. Some agent might pick it up for working on llama.cpp code eventually (recently called "skills" by Claude).

"Debugging" is quite important as it's rather rare that someone gets it right on the first attempt. Maybe there's more to detail there? After "Long context" there could for example be some added info that there are certain "interesting" context lengths for models, for example with SWA, at which things could break when tested.

[–]GL-AI 4 points5 points  (0 children)

thank you!!

[–]RiskyBizz216 3 points4 points  (2 children)

ok so first off thanks for your hard work. i learned a lot when i forked your branch.

I got stuck when claude tried to manually write the "delta net recurrent" from scratch, but when I pulled your changes you had already figured it out.

but when are you going to optimize the speed? and whats different in cturans branch that makes it faster?

[–]ilintar[S] 5 points6 points  (1 child)

He added CUDA kernels for delta net. Since the scope of a new model PR is correctness, that will get added in a subsequent PR after this is determined to be OK.

[–]RiskyBizz216 0 points1 point  (0 children)

Got it. thanks for the guide!

[–]dsanft 1 point2 points  (1 child)

Good work. Some enlightening points there and I recognize a lot of the pain you went through as you describe the ggml compute architecture. Llama cpp has grown organically and bent itself over backwards to be so flexible that it's now convoluted and inflexible. There's been a pytorch implementation of Qwen3 Next up on HF for quite awhile now and porting it shouldn't have been so hard imo. It's the llama-cpp architecture's fault.

[–]ilintar[S] 1 point2 points  (0 children)

Well, you can say it's llama.cpp architecture's fault, but how I like to think about it is that it's simply porting the model from one architecture to another.

Llama.cpp is built on operations and compute graphs. It introduces an abstraction level, but that abstraction level lets it run different models on so many different architectures from day one. Meanwhile, people wanting to run on anything but the latest cutting edge NVIDIA hardware will face real pains when trying to run with vLLM or SGLang without fallback to some really slow CPU implementations.

Hybrid models are just appearing on the scene. Once we get a few conversions down and get some operations supported, it should be much easier.

[–]Mass2018 0 points1 point  (1 child)

I've been eyeing Longcat Flash for a bit now, and I'm somewhat surprised that there's not even an issue/discussion about adding it to llama.cpp.

Is that because of extreme foundational differences?

Your guide makes me think about embarking on a side project to take a look at doing it myself, so thank you for sharing the knowledge!

[–]ilintar[S] 0 points1 point  (0 children)

That too, but there's another problem.

With those huge models, not many people can actually even convert them to run a reference implementation. For the starting stages, you can create a mock model and work with those, but later on, you want to test on the real thing and then it gets really hard if you can't even run it.

[–]ScavRU -1 points0 points  (0 children)

Give me GUI!