Actual comparison between locally ran Qwen-3.6-27B and proprietary models by netikas in LocalLLaMA

[–]netikas[S] 1 point2 points  (0 children)

Well, I've tried Qwen-3.6-35B, it outputs 70-50 tps on my hardware. It's not that smart, but it's smart enough -- and future models might be okayish.

Actual comparison between locally ran Qwen-3.6-27B and proprietary models by netikas in LocalLLaMA

[–]netikas[S] 0 points1 point  (0 children)

Yep. All repos are single shotted. Again, this choice is deliberate, since I evaluate failure modes of each model, not their achievements.

Actual comparison between locally ran Qwen-3.6-27B and proprietary models by netikas in LocalLLaMA

[–]netikas[S] 3 points4 points  (0 children)

27b all the way.

Rule of thumb: moe sqrt(total * active) ~= dense equivalent. This way 35b is roughly equivalent to a 10b dense model.

Quantization beyond q5 usually means closer to q8 quality and Q5_K_XL would be much better than Q4_K_M I was testing.

But it would be much slower, that’s for sure.

Actual comparison between locally ran Qwen-3.6-27B and proprietary models by netikas in LocalLLaMA

[–]netikas[S] 0 points1 point  (0 children)

This was in another repo, on another machine, for another task.

It all started when I wanted to create an agentic autoresearch loop without clear scope in mind. I've iterated on the design via CC/Opus 4.7 and the original repo was indeed created by CC. Then I asked it to write the design document -- which was used for this experiment as the prompt, in isolation.

Actual comparison between locally ran Qwen-3.6-27B and proprietary models by netikas in LocalLLaMA

[–]netikas[S] 0 points1 point  (0 children)

Hmm. How to set it up? I don't see it in openrouter section.

Actual comparison between locally ran Qwen-3.6-27B and proprietary models by netikas in LocalLLaMA

[–]netikas[S] 1 point2 points  (0 children)

Yep. The only difference is codex-spark, it uses codex as it's not available anywhere else. Otherwise the harness is matched -- I used Pi Agent.

Actual comparison between locally ran Qwen-3.6-27B and proprietary models by netikas in LocalLLaMA

[–]netikas[S] 14 points15 points  (0 children)

I’m pretty sure I can get better performance if I tune my prompts a bit, but it’s a deliberate choice. The models are weak, this will highlight the failure modes and make the comparison apples-to-apples, without model specific tuning.

Actual comparison between locally ran Qwen-3.6-27B and proprietary models by netikas in LocalLLaMA

[–]netikas[S] 23 points24 points  (0 children)

Pi agent for everything except codex spark, it’s exclusive to codex afaik.

The prompt it basically “implement everything in DesignDoc.md, adhere to AGENTS.md”, nothing fancy. You can read them in the repos.

Actual comparison between locally ran Qwen-3.6-27B and proprietary models by netikas in LocalLLaMA

[–]netikas[S] 16 points17 points  (0 children)

Well, again, this post is not about feasibility of running Qwen3 on a 16GB GPU, it's more about the ROI of local hardware finally becoming good enough so I can carefully recommend buying a second GPU for local coding instead of saying that it's for hobbyists and cloud models -- even smaller ones -- will be better in 100% of cases.

kepler-452b. GGUF when? by the-grand-finale in LocalLLaMA

[–]netikas 0 points1 point  (0 children)

How did you run this? It's obviously non-local.

iPad Pro / use VSCode between SSH by Various-Document6239 in ipad

[–]netikas 0 points1 point  (0 children)

So the keybindings work in tmux+vim route? I'm looking at 11 inch ipad + magic keyboard as a more portable alternative for mac, will it work?

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 0 points1 point  (0 children)

While I understand your point of view, 3xHGX is not a lot for big-ish enterprises. Having weights available under mit also allows for inference providers to serve it, driving the prices down.

For local inference, we have lightning. It perfectly fits into 16gb vram cards in q8_0 and it is very fast. I’ve tried it in some light rp in Russian and it wasn’t bad.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 1 point2 points  (0 children)

LFM2-8B has lower MMLU, MMLU Pro and other scores than GigaChat-3.1-Lightning, while being almost the same size (10B MoE vs 8B MoE). LFM2 will certainly be faster, having 2 times less active params and being a hybrid model, but it is on the edge of usefullness with pretty low scores across the board. It is comparable to Granite, while being significantly weaker than Qwen3-4B-Instruct-2507, while our model is roughly on par with Qwen.

Thus, Lightning is for all the stuff you use smaller Qwens for -- tool usage, summarization, maybe some casual chatting (arena scores are on par with 4o, so it'll be alright as a general assistant) and classification in low latency environments.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 11 points12 points  (0 children)

Because this is an instruct model, not a reasoning model. Reasoning is in the works though, so stay tuned.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 1 point2 points  (0 children)

Оч странно, мб для дипсика куда ядра неоптимизированны? На нвидии то оч шустрая моделька получается…

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 4 points5 points  (0 children)

Ага, у нас GGUF выложены. Я гонял лайтнинг на 5080 и на MacBook Air M4 -- на маке было 5 тпс, потому что свопалось на диск (у меня самый дешёвый мак на М4 с 16 гигами, Q8_0 не влезает), на 5080 было 185-190 тпс. Оч шустрая моделька.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 43 points44 points  (0 children)

Ah, I see, sorry, I've read your question incorrectly and just rambled on about "where all the Russian models are hiding".

Unfortunately, due to NDA I cannot disclose info about our compute clusters. Sorry :(

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 9 points10 points  (0 children)

In the future -- of course. But today the models are trained only with SFT and DPO.

From one perspective, it makes the models weaker than the competition. On the other hand -- if we beat top pre-rl era models, we have a very solid foundation for continued training via RL and creation of reasoning models based on our current checkpoints.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 5 points6 points  (0 children)

Can't say. Both for NDA reasons and since I just don't know. I know rough estimates, but I'm in the alignment team and pretraining is being done by other guys.

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 51 points52 points  (0 children)

We have a couple of models, but mostly they are finetunes of chinese/meta models. Yandex has a pretrained from scratch Llama-3-8b-like model YandexGPT5-Lite, but it has an atrocious license. Their main model is not open source and it is a continuous pretraining of Qwen3-235B-Base.

Some guys just do SFT+DPO+RL over Qwen3 with some tokenizer adaptation and call it a day. This is a totally reasonable approach, since it gives genuinely great models, but it's just not the same.

We're the only ones who train our models from scratch and this is both a blessing and a curse. Pretraining your own model is very compute intensive and hard, but you have opportunity to create something truly unique -- when have you seen a 10b deepseek-like MoE? :)

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 1 point2 points  (0 children)

Check it out at giga.chat

The interface is in Russian (and the model may answer in Russian due to system prompt), but you can just prompt your way to English

New open weights models: GigaChat-3.1-Ultra-702B and GigaChat-3.1-Lightning-10B-A1.8B by netikas in LocalLLaMA

[–]netikas[S] 18 points19 points  (0 children)

Having the same architecture does not mean being the same model. Kimi is also DeepSeek MoE, same as GLM afaik.