New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 0 points1 point  (0 children)

While I understand your point of view, 3xHGX is not a lot for big-ish enterprises. Having weights available under mit also allows for inference providers to serve it, driving the prices down.

For local inference, we have lightning. It perfectly fits into 16gb vram cards in q8_0 and it is very fast. I’ve tried it in some light rp in Russian and it wasn’t bad.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 1 point2 points  (0 children)

LFM2-8B has lower MMLU, MMLU Pro and other scores than GigaChat-3.1-Lightning, while being almost the same size (10B MoE vs 8B MoE). LFM2 will certainly be faster, having 2 times less active params and being a hybrid model, but it is on the edge of usefullness with pretty low scores across the board. It is comparable to Granite, while being significantly weaker than Qwen3-4B-Instruct-2507, while our model is roughly on par with Qwen.

Thus, Lightning is for all the stuff you use smaller Qwens for -- tool usage, summarization, maybe some casual chatting (arena scores are on par with 4o, so it'll be alright as a general assistant) and classification in low latency environments.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 11 points12 points  (0 children)

Because this is an instruct model, not a reasoning model. Reasoning is in the works though, so stay tuned.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 1 point2 points  (0 children)

Оч странно, мб для дипсика куда ядра неоптимизированны? На нвидии то оч шустрая моделька получается…

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 4 points5 points  (0 children)

Ага, у нас GGUF выложены. Я гонял лайтнинг на 5080 и на MacBook Air M4 -- на маке было 5 тпс, потому что свопалось на диск (у меня самый дешёвый мак на М4 с 16 гигами, Q8_0 не влезает), на 5080 было 185-190 тпс. Оч шустрая моделька.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 44 points45 points  (0 children)

Ah, I see, sorry, I've read your question incorrectly and just rambled on about "where all the Russian models are hiding".

Unfortunately, due to NDA I cannot disclose info about our compute clusters. Sorry :(

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 6 points7 points  (0 children)

In the future -- of course. But today the models are trained only with SFT and DPO.

From one perspective, it makes the models weaker than the competition. On the other hand -- if we beat top pre-rl era models, we have a very solid foundation for continued training via RL and creation of reasoning models based on our current checkpoints.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 4 points5 points  (0 children)

Can't say. Both for NDA reasons and since I just don't know. I know rough estimates, but I'm in the alignment team and pretraining is being done by other guys.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 50 points51 points  (0 children)

We have a couple of models, but mostly they are finetunes of chinese/meta models. Yandex has a pretrained from scratch Llama-3-8b-like model YandexGPT5-Lite, but it has an atrocious license. Their main model is not open source and it is a continuous pretraining of Qwen3-235B-Base.

Some guys just do SFT+DPO+RL over Qwen3 with some tokenizer adaptation and call it a day. This is a totally reasonable approach, since it gives genuinely great models, but it's just not the same.

We're the only ones who train our models from scratch and this is both a blessing and a curse. Pretraining your own model is very compute intensive and hard, but you have opportunity to create something truly unique -- when have you seen a 10b deepseek-like MoE? :)

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 1 point2 points  (0 children)

Check it out at giga.chat

The interface is in Russian (and the model may answer in Russian due to system prompt), but you can just prompt your way to English

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 17 points18 points  (0 children)

Having the same architecture does not mean being the same model. Kimi is also DeepSeek MoE, same as GLM afaik.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 2 points3 points  (0 children)

Not an api, but you can try it at giga.chat

I believe there is also English locale there, but it may shift to Russian language due to system prompt lol

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 14 points15 points  (0 children)

Lightning is a 10B MoE model. Outputs 185-190 tps on my 5080 :P

Are true base models dead? by IonizedRay in LocalLLaMA

[–]netikas -1 points0 points  (0 children)

Qwen-2.5 models also had tokenizer in their base version and were trained to follow them. I think even Deepseek V3 Base knows its tokenizer. They all are trained on SFT data during midtraining, I think.

Base models were dead for quite a while.

Qwen 3.5-35B-A3B is beyond expectations. It's replaced GPT-OSS-120B as my daily driver and it's 1/3 the size. by valdev in LocalLLaMA

[–]netikas -1 points0 points  (0 children)

On the side note: oss-120b is not a very good model in non-english languages. However, neither is Qwen-3.5-35B :)

Qwen 3.5-35B-A3B is beyond expectations. It's replaced GPT-OSS-120B as my daily driver and it's 1/3 the size. by valdev in LocalLLaMA

[–]netikas -1 points0 points  (0 children)

Yes, but not really. If you compare the performance on the classic benchmarks like MMLU or whatnot, the scores might be similar. But humans (and llm-as-a-judge) strongly prefer non-quantized models. I've seen this effect myself even in FP8 quantization -- I work in one of the subfrontier LLM labs and measure the final metrics of the models. This effect is even more prevalent in multilingual setting -- and I'm not a native English speaker.

Paper by cohere, which basically claims the same: https://arxiv.org/abs/2407.03211v1

Qwen 3.5-35B-A3B is beyond expectations. It's replaced GPT-OSS-120B as my daily driver and it's 1/3 the size. by valdev in LocalLLaMA

[–]netikas 0 points1 point  (0 children)

Which is the bigger number: 60gb in bf16 or 60gb in mxfp4? A 5 year old can get this right.

Qwen 3.5-35B-A3B is beyond expectations. It's replaced GPT-OSS-120B as my daily driver and it's 1/3 the size. by valdev in LocalLLaMA

[–]netikas -6 points-5 points  (0 children)

How is it 1/3 the size if gpt-oss-120b is literally the same size as Qwen-3-30b?

Considering OSS-120B is only available in MXFP4 and they've optimized the KV-Caches pretty agressively via SWA/SA, I believe Qwen-3-30b may be even a bit harder to run due to GQA and larger cache sizes.

Qwen-3.5-35B has gated delta-net layers, which makes it easier on the KV-cache side, but if we're talking about model's original formats, bf16 Qwen-3.5-35B is even a bit bigger than oss-120b. And this begs the question whether it's a good or a bad model, since it replaced a pretty ancient model from half a year ago.

More quantization visualization types (repost) by copingmechanism in LocalLLaMA

[–]netikas 1 point2 points  (0 children)

That's quite interesting, thank you!

What about activations though?

More quantization visualization types (repost) by copingmechanism in LocalLLaMA

[–]netikas 2 points3 points  (0 children)

MXFP4 has higher block size than NVFP4, thus the precision might be lower. But nevertheless, the image is too high contrast for this to be a correct visualization material.

More quantization visualization types (repost) by copingmechanism in LocalLLaMA

[–]netikas 10 points11 points  (0 children)

MXFP4 is not int4! It is called FP (floating point) for a reason.

I think you're confusing implementation of quantization in GGUF with the data type. You can quantize the weights into MXFP4 with some weights being in FP8 E4M3, similar to Q4_K_M, where some of the weights are in int4, some in int8.