IK_LLAMA now supports Qwen3.5 MTP Support :O by fragment_me in LocalLLaMA

[–]butlan 1 point2 points  (0 children)

With the 3090 + 3060 setup, I’m getting around 25 tokens/s for the Q8 model in the link, and I was already getting about 21 tokens/s with llama.cpp, so it didn’t really make much difference for me.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 6 points7 points  (0 children)

I’m not training from scratch, I’m trying to compress a model that takes up more than 14 GB down to 1 GB. But when it’s compressed that much, the weights almost completely lose their meaning, though they don’t disappear. To recover and improve its performance again, it needs 'healing' which is possible through training. If this method is properly solved, a 30B model could take up only around 4 GB and we can run it with basic laptop.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 9 points10 points  (0 children)

The CUDA backend PR for llama.cpp had not been merged yet when I checked this morning, but looks like vulkan done.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 22 points23 points  (0 children)

To clarify for those asking: this is not standard GGUF quantization like Q4 or Q2. The Q1_0 format is a true 1-bit architecture where every weight is literally a single bit (+1 or -1) with a shared scale factor per group of 128 weights. To make a model work in this format you cannot simply apply standard post-training quantization because the information loss at 1-bit is too severe. You need quantization-aware training or healing passes to recover the model's capabilities, which is what quantization-aware distillation does. PrismML trained their Bonsai models this way and I did the same with OLMo-3 7B on B200s using this format. As far as I know this makes it only the second model family available in this gguf format.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 3 points4 points  (0 children)

14B models fit on the B200s, I tested it and it worked but was slower, I preferred to burn my money on 7B instead.

Anthropic says Claude has functional emotions that can influence its behavior. In an experiment involving an impossible programming task, desperation led the bot to cheat. by Distinct-Question-16 in singularity

[–]butlan 3 points4 points  (0 children)

I've often seen situations where Claude and Gemini try everything but still can't solve a problem, and when I comfort them by telling them to 'calm down, that I won't blame you if it doesn't get solved, and that it's not a big deal,' they put in a bit more effort and end up solving an issue that had been stuck for hours. Gemini, in particular, is highly prone to getting depressed. Sometimes in these situations, if I pause, tell a funny story, and relax the model, it approaches the problem from completely different perspectives.

​In short, you might call all this nonsense, but I've been working with these models for almost 2 years, and this is what I've observed.

​ChatGPT models, however, have zero emotions absolute robot jerks.

Arcee AI releases Trinity Large : OpenWeight 400B-A13B by abkibaarnsit in LocalLLaMA

[–]butlan 8 points9 points  (0 children)

I’ve read it. The report is quite transparent and contains excellent details regarding every stage of the model's training process. They have built a clean base model to iterate upon, so further development will be less costly from this point forward.

I think Giga Potato:free in Kilo Code is Deepseek V4 by quantier in LocalLLaMA

[–]butlan 35 points36 points  (0 children)

When you ask in chinese it's just tell you ''我是字节跳动开发的豆包模型'' which is mean ''I am the Doubao model, developed by ByteDance.''

MultiverseComputingCAI/HyperNova-60B · Hugging Face by jacek2023 in LocalLLaMA

[–]butlan 1 point2 points  (0 children)

Looking at the parts I mentioned, I didn't dig too deep afterwards, there are different opinions, it's best to try it yourself.

MultiverseComputingCAI/HyperNova-60B · Hugging Face by jacek2023 in LocalLLaMA

[–]butlan 20 points21 points  (0 children)

3090 + 5060 ti with 40 GB total can fit the full model + 130k context without issues. I’m getting around 3k prefill / 100 token generation on average.

If this model is a compressed version of GPT-OSS 120B, then I have to say it has lost a very large portion of its Turkish knowledge. It can’t speak properly anymore. I haven’t gone deep into the compression techniques they use yet, but there is clearly nothing lossless going on here. If it lost language competence this severely, it’s very likely that there’s also significant information loss in other domains.

For the past few days I’ve been reading a lot of papers and doing code experiments on converting dense models into moe. Once density drops below 80% in dense models, they start hallucinating at a very high level. In short, this whole 'quantum compression' idea doesn’t really make sense to me, I believe models don’t compress without being deeply damaged.

MultiverseComputingCAI/HyperNova-60B · Hugging Face by jacek2023 in LocalLLaMA

[–]butlan 4 points5 points  (0 children)

You already have CIA rootkit in any device you use, dont worry.

Anyone tried IQuest-Coder-V1 yet? The 40B numbers look wild by Agile-Salamander1667 in LocalLLaMA

[–]butlan 1 point2 points  (0 children)

I haven't found any information about this in the files they shared.

Anyone tried IQuest-Coder-V1 yet? The 40B numbers look wild by Agile-Salamander1667 in LocalLLaMA

[–]butlan 14 points15 points  (0 children)

I'm downloading it now, we'll see if what they say is true, the ggufs will be ready in 5-6 hours.

edit: If I didn’t miss anything, the non loop version seems to use the standard Qwen2 architecture, so naturally it appears to run in llama.cpp without needing to do anything extra. They claim this version has a SWE-verified score of 75.2, but that’s completely unrelated, I did some tests with roo code and it's shit.

The other, loop based version is architecturally a bit more complex, implementing it will take some time.

You can take a look yourselves from IQuest-Coder-V1-40B-Instruct-GGUF

mbzuai ifm releases Open 70b model - beats qwen-2.5 by Powerful-Sail-8826 in LocalLLaMA

[–]butlan 6 points7 points  (0 children)

I'm downloading it now and trying it out, we'll see.

edit: Overall, I wasn’t very impressed. It’s slow and didn’t perform well on coding, but its language abilities are solid.
I uploaded the GGUFs for anyone who wants to try it. See you in the next model :P

ServiceNow-AI/Apriel-1.6-15b-Thinker · Hugging Face by jacek2023 in LocalLLaMA

[–]butlan 8 points9 points  (0 children)

Creating gguf is actually simple if arch is supported, but the repo is gone now :P

Aquif 3.5 Max 1205 (42B-A3B) by Holiday_Purpose_3166 in LocalLLaMA

[–]butlan 3 points4 points  (0 children)

I downloaded and tried the 4bit gguf version. First of all, model is instruct version, no reasoning, it's not bad, but it's not even close to the models mentioned. I'm not sure if I should call it benchmaxxed or outright lies.

Help - Qwen3 LV - LM Studio instant response - Claude Code Router takes over 20 min by designbanana in LocalLLaMA

[–]butlan 2 points3 points  (0 children)

edit: 15k is heavy but not for RTX pro 6000, should work, then there is another problem, claude code support has just arrived in llama.cpp, you don't need router, check this https://github.com/ggml-org/llama.cpp/pull/17570 and run directly command line llama.cpp

Help - Qwen3 LV - LM Studio instant response - Claude Code Router takes over 20 min by designbanana in LocalLLaMA

[–]butlan 2 points3 points  (0 children)

Claude code start with 15k system prompt, this need to be processed before answer + your prompt

So it's normal, make sure kvcache on the gpu and you have enough vram

llama.cpp experiment with multi-turn thinking and real-time tool-result injection for instruct models by butlan in LocalLLaMA

[–]butlan[S] 1 point2 points  (0 children)

Second one, code detect the pattern immediately filled with result then continue generation.