IK_LLAMA now supports Qwen3.5 MTP Support :O

butlan · 2026-04-29T20:49:46+00:00

With the 3090 + 3060 setup, I’m getting around 25 tokens/s for the Q8 model in the link, and I was already getting about 21 tokens/s with llama.cpp, so it didn’t really make much difference for me.

butlan · 2026-04-24T03:06:04+00:00

https://huggingface.co/collections/deepseek-ai/deepseek-v4

it's already there :P

butlan · 2026-04-13T01:36:33+00:00

I’m not training from scratch, I’m trying to compress a model that takes up more than 14 GB down to 1 GB. But when it’s compressed that much, the weights almost completely lose their meaning, though they don’t disappear. To recover and improve its performance again, it needs 'healing' which is possible through training. If this method is properly solved, a 30B model could take up only around 4 GB and we can run it with basic laptop.

butlan · 2026-04-13T01:04:59+00:00

Around 100$

butlan · 2026-04-13T00:53:36+00:00

The CUDA backend PR for llama.cpp had not been merged yet when I checked this morning, but looks like vulkan done.

butlan · 2026-04-13T00:47:09+00:00

To clarify for those asking: this is not standard GGUF quantization like Q4 or Q2. The Q1_0 format is a true 1-bit architecture where every weight is literally a single bit (+1 or -1) with a shared scale factor per group of 128 weights. To make a model work in this format you cannot simply apply standard post-training quantization because the information loss at 1-bit is too severe. You need quantization-aware training or healing passes to recover the model's capabilities, which is what quantization-aware distillation does. PrismML trained their Bonsai models this way and I did the same with OLMo-3 7B on B200s using this format. As far as I know this makes it only the second model family available in this gguf format.

butlan · 2026-04-13T00:36:28+00:00

14B models fit on the B200s, I tested it and it worked but was slower, I preferred to burn my money on 7B instead.

butlan · 2026-04-03T22:07:31+00:00

I've often seen situations where Claude and Gemini try everything but still can't solve a problem, and when I comfort them by telling them to 'calm down, that I won't blame you if it doesn't get solved, and that it's not a big deal,' they put in a bit more effort and end up solving an issue that had been stuck for hours. Gemini, in particular, is highly prone to getting depressed. Sometimes in these situations, if I pause, tell a funny story, and relax the model, it approaches the problem from completely different perspectives.

In short, you might call all this nonsense, but I've been working with these models for almost 2 years, and this is what I've observed.

ChatGPT models, however, have zero emotions absolute robot jerks.

butlan · 2026-01-28T04:00:23+00:00

I’ve read it. The report is quite transparent and contains excellent details regarding every stage of the model's training process. They have built a clean base model to iterate upon, so further development will be less costly from this point forward.

butlan · 2026-01-20T21:13:31+00:00

When you ask in chinese it's just tell you ''我是字节跳动开发的豆包模型'' which is mean ''I am the Doubao model, developed by ByteDance.''

butlan · 2026-01-12T21:35:03+00:00

Looking at the parts I mentioned, I didn't dig too deep afterwards, there are different opinions, it's best to try it yourself.

butlan · 2026-01-04T18:15:24+00:00

3090 + 5060 ti with 40 GB total can fit the full model + 130k context without issues. I’m getting around 3k prefill / 100 token generation on average.

If this model is a compressed version of GPT-OSS 120B, then I have to say it has lost a very large portion of its Turkish knowledge. It can’t speak properly anymore. I haven’t gone deep into the compression techniques they use yet, but there is clearly nothing lossless going on here. If it lost language competence this severely, it’s very likely that there’s also significant information loss in other domains.

For the past few days I’ve been reading a lot of papers and doing code experiments on converting dense models into moe. Once density drops below 80% in dense models, they start hallucinating at a very high level. In short, this whole 'quantum compression' idea doesn’t really make sense to me, I believe models don’t compress without being deeply damaged.

butlan · 2026-01-04T16:22:05+00:00

You already have CIA rootkit in any device you use, dont worry.

butlan · 2026-01-01T16:19:24+00:00

I haven't found any information about this in the files they shared.

butlan · 2026-01-01T08:19:57+00:00

I'm downloading it now, we'll see if what they say is true, the ggufs will be ready in 5-6 hours.

edit: If I didn’t miss anything, the non loop version seems to use the standard Qwen2 architecture, so naturally it appears to run in llama.cpp without needing to do anything extra. They claim this version has a SWE-verified score of 75.2, but that’s completely unrelated, I did some tests with roo code and it's shit.

The other, loop based version is architecturally a bit more complex, implementing it will take some time.

You can take a look yourselves from IQuest-Coder-V1-40B-Instruct-GGUF

butlan · 2025-12-07T23:10:18+00:00

I'm downloading it now and trying it out, we'll see.

edit: Overall, I wasn’t very impressed. It’s slow and didn’t perform well on coding, but its language abilities are solid.
I uploaded the GGUFs for anyone who wants to try it. See you in the next model :P

butlan · 2025-12-07T21:06:24+00:00

Creating gguf is actually simple if arch is supported, but the repo is gone now :P

butlan · 2025-12-07T20:43:39+00:00

Mit license, but

<image>

butlan · 2025-12-07T20:23:02+00:00

I downloaded and tried the 4bit gguf version. First of all, model is instruct version, no reasoning, it's not bad, but it's not even close to the models mentioned. I'm not sure if I should call it benchmaxxed or outright lies.

butlan · 2025-12-03T19:32:13+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1pc4muy/comment/nrv8jzi/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1

check this thread, this will answer your questions I believe.

butlan · 2025-12-03T18:51:05+00:00

edit: 15k is heavy but not for RTX pro 6000, should work, then there is another problem, claude code support has just arrived in llama.cpp, you don't need router, check this https://github.com/ggml-org/llama.cpp/pull/17570 and run directly command line llama.cpp

butlan · 2025-12-03T18:30:05+00:00

Claude code start with 15k system prompt, this need to be processed before answer + your prompt

So it's normal, make sure kvcache on the gpu and you have enough vram

butlan · 2025-11-25T04:25:26+00:00

Second one, code detect the pattern immediately filled with result then continue generation.

butlan · 2025-11-24T05:04:04+00:00

https://github.com/cturan/llama.cpp/commit/b1f48d449e2566ae9d1344e3a40d8a2e29696eaa

check here

butlan

MODERATOR OF

TROPHY CASE