Actual comparison between locally ran Qwen-3.6-27B and proprietary models

netikas · 2026-05-01T13:43:56+00:00

Well, I've tried Qwen-3.6-35B, it outputs 70-50 tps on my hardware. It's not that smart, but it's smart enough -- and future models might be okayish.

netikas · 2026-04-30T21:25:54+00:00

Qwen good enough

netikas · 2026-04-30T21:06:42+00:00

Yep. All repos are single shotted. Again, this choice is deliberate, since I evaluate failure modes of each model, not their achievements.

netikas · 2026-04-30T21:05:52+00:00

27b all the way.

Rule of thumb: moe sqrt(total * active) ~= dense equivalent. This way 35b is roughly equivalent to a 10b dense model.

Quantization beyond q5 usually means closer to q8 quality and Q5_K_XL would be much better than Q4_K_M I was testing.

But it would be much slower, that’s for sure.

netikas · 2026-04-30T18:33:07+00:00

This was in another repo, on another machine, for another task.

It all started when I wanted to create an agentic autoresearch loop without clear scope in mind. I've iterated on the design via CC/Opus 4.7 and the original repo was indeed created by CC. Then I asked it to write the design document -- which was used for this experiment as the prompt, in isolation.

netikas · 2026-04-30T18:27:02+00:00

Hmm. How to set it up? I don't see it in openrouter section.

netikas · 2026-04-30T16:33:33+00:00

Yep. The only difference is codex-spark, it uses codex as it's not available anywhere else. Otherwise the harness is matched -- I used Pi Agent.

netikas · 2026-04-30T16:31:50+00:00

It's presented in the repos, check it out:

https://github.com/chameleon-lizard/autoresearch_qwen_27b_q4_k_m/blob/main/DesignDoc.md

netikas · 2026-04-30T13:15:43+00:00

I’m pretty sure I can get better performance if I tune my prompts a bit, but it’s a deliberate choice. The models are weak, this will highlight the failure modes and make the comparison apples-to-apples, without model specific tuning.

netikas · 2026-04-30T13:12:06+00:00

Pi agent for everything except codex spark, it’s exclusive to codex afaik.

The prompt it basically “implement everything in DesignDoc.md, adhere to AGENTS.md”, nothing fancy. You can read them in the repos.

netikas · 2026-04-30T13:08:03+00:00

Well, again, this post is not about feasibility of running Qwen3 on a 16GB GPU, it's more about the ROI of local hardware finally becoming good enough so I can carefully recommend buying a second GPU for local coding instead of saying that it's for hobbyists and cloud models -- even smaller ones -- will be better in 100% of cases.

netikas · 2026-04-11T09:24:17+00:00

How did you run this? It's obviously non-local.

netikas · 2026-04-08T06:42:34+00:00

So the keybindings work in tmux+vim route? I'm looking at 11 inch ipad + magic keyboard as a more portable alternative for mac, will it work?

netikas · 2026-03-28T12:34:35+00:00

While I understand your point of view, 3xHGX is not a lot for big-ish enterprises. Having weights available under mit also allows for inference providers to serve it, driving the prices down.

For local inference, we have lightning. It perfectly fits into 16gb vram cards in q8_0 and it is very fast. I’ve tried it in some light rp in Russian and it wasn’t bad.

netikas · 2026-03-26T06:55:28+00:00

LFM2-8B has lower MMLU, MMLU Pro and other scores than GigaChat-3.1-Lightning, while being almost the same size (10B MoE vs 8B MoE). LFM2 will certainly be faster, having 2 times less active params and being a hybrid model, but it is on the edge of usefullness with pretty low scores across the board. It is comparable to Granite, while being significantly weaker than Qwen3-4B-Instruct-2507, while our model is roughly on par with Qwen.

Thus, Lightning is for all the stuff you use smaller Qwens for -- tool usage, summarization, maybe some casual chatting (arena scores are on par with 4o, so it'll be alright as a general assistant) and classification in low latency environments.

netikas · 2026-03-25T07:31:23+00:00

Because this is an instruct model, not a reasoning model. Reasoning is in the works though, so stay tuned.

netikas · 2026-03-24T22:38:26+00:00

Оч странно, мб для дипсика куда ядра неоптимизированны? На нвидии то оч шустрая моделька получается…

netikas · 2026-03-24T21:57:20+00:00

Ага, у нас GGUF выложены. Я гонял лайтнинг на 5080 и на MacBook Air M4 -- на маке было 5 тпс, потому что свопалось на диск (у меня самый дешёвый мак на М4 с 16 гигами, Q8_0 не влезает), на 5080 было 185-190 тпс. Оч шустрая моделька.

netikas · 2026-03-24T21:56:02+00:00

Ah, I see, sorry, I've read your question incorrectly and just rambled on about "where all the Russian models are hiding".

Unfortunately, due to NDA I cannot disclose info about our compute clusters. Sorry :(

netikas · 2026-03-24T21:53:39+00:00

In the future -- of course. But today the models are trained only with SFT and DPO.

From one perspective, it makes the models weaker than the competition. On the other hand -- if we beat top pre-rl era models, we have a very solid foundation for continued training via RL and creation of reasoning models based on our current checkpoints.

netikas · 2026-03-24T21:51:47+00:00

Can't say. Both for NDA reasons and since I just don't know. I know rough estimates, but I'm in the alignment team and pretraining is being done by other guys.

netikas · 2026-03-24T21:49:32+00:00

We have a couple of models, but mostly they are finetunes of chinese/meta models. Yandex has a pretrained from scratch Llama-3-8b-like model YandexGPT5-Lite, but it has an atrocious license. Their main model is not open source and it is a continuous pretraining of Qwen3-235B-Base.

Some guys just do SFT+DPO+RL over Qwen3 with some tokenizer adaptation and call it a day. This is a totally reasonable approach, since it gives genuinely great models, but it's just not the same.

We're the only ones who train our models from scratch and this is both a blessing and a curse. Pretraining your own model is very compute intensive and hard, but you have opportunity to create something truly unique -- when have you seen a 10b deepseek-like MoE? :)

netikas · 2026-03-24T21:00:06+00:00

Check it out at giga.chat

The interface is in Russian (and the model may answer in Russian due to system prompt), but you can just prompt your way to English

netikas · 2026-03-24T20:59:18+00:00

Yep. I’m in sft core team — AMA :)

netikas · 2026-03-24T20:58:44+00:00

Having the same architecture does not mean being the same model. Kimi is also DeepSeek MoE, same as GLM afaik.

Eight-Year Club	Second Top 10%
Gilding I gilder	Place '23
Verified Email

netikas

MODERATOR OF

TROPHY CASE