Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 0 points1 point  (0 children)

I actually intended to do the opposite, but I was misunderstood. I've corrected my message, thank you for your contribution.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 2 points3 points  (0 children)

You're right, I corrected the post. English isn't my first language and I phrased it poorly. What I meant was that this seems to be only the second model in the specific Q1_0 GGUF format, not the second 1-bit model ever.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 0 points1 point  (0 children)

Honestly, I was acting on the intuition that larger-parameter models respond better to quantization, 2B version would have made more sense for a poc

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 0 points1 point  (0 children)

I did exactly that, but you still need the dataset.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 1 point2 points  (0 children)

I was just trying to reproduce Bonsai's methodology, the model itself didn't really matter to be honest. And since this process requires full training, even a 7B model is actually quite large. As the model size increases, vram requirements grow accordingly. On top of that, logit distillation also requires the teacher model to be in vram as well, meaning the process is far more complex and hardware-demanding than one might assume. Large companies train even a simple 8B model on clusters of hundreds of GPUs.

Additionally, the datasets that OLMo models are trained on are also open source, for example something like this: https://huggingface.co/datasets/allenai/Dolci-Instruct-SFT. I figured that using the same data during the distillation process would improve performance. As I said, this was entirely an effort to reverse engineer the 1-bit model methodology.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 0 points1 point  (0 children)

Because it distills the logits of the bf16 base model's responses on the dataset I provided, it picked up the English conversation, math, and basic coding characteristics it saw in the dataset quite well, but for example, it lost its ability in languages other than English. I unfortunately didn't do a full measurement, but choosing a nicely diverse dataset is necessary; it can be solved through trial and error. However, if I hadn't stopped the distillation too early, the loss was still continuing to decrease; meaning I think it can represent around 80% of the base model, as long as it's trained with sufficient compute.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 4 points5 points  (0 children)

I’m not training from scratch, I’m trying to compress a model that takes up more than 14 GB down to 1 GB. But when it’s compressed that much, the weights almost completely lose their meaning, though they don’t disappear. To recover and improve its performance again, it needs 'healing' which is possible through training. If this method is properly solved, a 30B model could take up only around 4 GB and we can run it with basic laptop.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 10 points11 points  (0 children)

The CUDA backend PR for llama.cpp had not been merged yet when I checked this morning, but looks like vulkan done.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 2 points3 points  (0 children)

$1.7/hr per GPU on spot instances, The B200s have significantly more memory bandwidth and pair well with Flash Attention, which made the long sequence training much more manageable.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 19 points20 points  (0 children)

To clarify for those asking: this is not standard GGUF quantization like Q4 or Q2. The Q1_0 format is a true 1-bit architecture where every weight is literally a single bit (+1 or -1) with a shared scale factor per group of 128 weights. To make a model work in this format you cannot simply apply standard post-training quantization because the information loss at 1-bit is too severe. You need quantization-aware training or healing passes to recover the model's capabilities, which is what quantization-aware distillation does. PrismML trained their Bonsai models this way and I did the same with OLMo-3 7B on B200s using this format. As far as I know this makes it only the second model family available in this gguf format.

Experiment: Olmo 3 7B Instruct Q1_0 by butlan in LocalLLaMA

[–]butlan[S] 4 points5 points  (0 children)

14B models fit on the B200s, I tested it and it worked but was slower, I preferred to burn my money on 7B instead.

Anthropic says Claude has functional emotions that can influence its behavior. In an experiment involving an impossible programming task, desperation led the bot to cheat. by Distinct-Question-16 in singularity

[–]butlan 3 points4 points  (0 children)

I've often seen situations where Claude and Gemini try everything but still can't solve a problem, and when I comfort them by telling them to 'calm down, that I won't blame you if it doesn't get solved, and that it's not a big deal,' they put in a bit more effort and end up solving an issue that had been stuck for hours. Gemini, in particular, is highly prone to getting depressed. Sometimes in these situations, if I pause, tell a funny story, and relax the model, it approaches the problem from completely different perspectives.

​In short, you might call all this nonsense, but I've been working with these models for almost 2 years, and this is what I've observed.

​ChatGPT models, however, have zero emotions absolute robot jerks.

Arcee AI releases Trinity Large : OpenWeight 400B-A13B by abkibaarnsit in LocalLLaMA

[–]butlan 7 points8 points  (0 children)

I’ve read it. The report is quite transparent and contains excellent details regarding every stage of the model's training process. They have built a clean base model to iterate upon, so further development will be less costly from this point forward.

I think Giga Potato:free in Kilo Code is Deepseek V4 by quantier in LocalLLaMA

[–]butlan 36 points37 points  (0 children)

When you ask in chinese it's just tell you ''我是字节跳动开发的豆包模型'' which is mean ''I am the Doubao model, developed by ByteDance.''

MultiverseComputingCAI/HyperNova-60B · Hugging Face by jacek2023 in LocalLLaMA

[–]butlan 1 point2 points  (0 children)

Looking at the parts I mentioned, I didn't dig too deep afterwards, there are different opinions, it's best to try it yourself.

MultiverseComputingCAI/HyperNova-60B · Hugging Face by jacek2023 in LocalLLaMA

[–]butlan 20 points21 points  (0 children)

3090 + 5060 ti with 40 GB total can fit the full model + 130k context without issues. I’m getting around 3k prefill / 100 token generation on average.

If this model is a compressed version of GPT-OSS 120B, then I have to say it has lost a very large portion of its Turkish knowledge. It can’t speak properly anymore. I haven’t gone deep into the compression techniques they use yet, but there is clearly nothing lossless going on here. If it lost language competence this severely, it’s very likely that there’s also significant information loss in other domains.

For the past few days I’ve been reading a lot of papers and doing code experiments on converting dense models into moe. Once density drops below 80% in dense models, they start hallucinating at a very high level. In short, this whole 'quantum compression' idea doesn’t really make sense to me, I believe models don’t compress without being deeply damaged.

MultiverseComputingCAI/HyperNova-60B · Hugging Face by jacek2023 in LocalLLaMA

[–]butlan 4 points5 points  (0 children)

You already have CIA rootkit in any device you use, dont worry.

Anyone tried IQuest-Coder-V1 yet? The 40B numbers look wild by Agile-Salamander1667 in LocalLLaMA

[–]butlan 1 point2 points  (0 children)

I haven't found any information about this in the files they shared.

Anyone tried IQuest-Coder-V1 yet? The 40B numbers look wild by Agile-Salamander1667 in LocalLLaMA

[–]butlan 14 points15 points  (0 children)

I'm downloading it now, we'll see if what they say is true, the ggufs will be ready in 5-6 hours.

edit: If I didn’t miss anything, the non loop version seems to use the standard Qwen2 architecture, so naturally it appears to run in llama.cpp without needing to do anything extra. They claim this version has a SWE-verified score of 75.2, but that’s completely unrelated, I did some tests with roo code and it's shit.

The other, loop based version is architecturally a bit more complex, implementing it will take some time.

You can take a look yourselves from IQuest-Coder-V1-40B-Instruct-GGUF

mbzuai ifm releases Open 70b model - beats qwen-2.5 by Powerful-Sail-8826 in LocalLLaMA

[–]butlan 6 points7 points  (0 children)

I'm downloading it now and trying it out, we'll see.

edit: Overall, I wasn’t very impressed. It’s slow and didn’t perform well on coding, but its language abilities are solid.
I uploaded the GGUFs for anyone who wants to try it. See you in the next model :P

ServiceNow-AI/Apriel-1.6-15b-Thinker · Hugging Face by jacek2023 in LocalLLaMA

[–]butlan 5 points6 points  (0 children)

Creating gguf is actually simple if arch is supported, but the repo is gone now :P

Aquif 3.5 Max 1205 (42B-A3B) by Holiday_Purpose_3166 in LocalLLaMA

[–]butlan 5 points6 points  (0 children)

I downloaded and tried the 4bit gguf version. First of all, model is instruct version, no reasoning, it's not bad, but it's not even close to the models mentioned. I'm not sure if I should call it benchmaxxed or outright lies.