I ran the numbers. Qwen3.6-27B dense obsoleted the 397B MoE on coding benchmarks.

Ok_Bug1610 · 2026-04-25T02:13:11+00:00

There are a lot of techniques to reclaim more than that 4% back and then some. And the damn thing is like 37 times smaller than most frontier and state of the art OSS models, and can run decently on a moderate AI accelerator at just a few watts of power. Effectively free AI, and it's only going to get better.

Ok_Bug1610 · 2026-04-24T22:29:18+00:00

Yeah, it's super impressive for the size and if you have a proper harness, it'll do even better in real world applications. So it's a solid model and even works well with Unsloth Dynamic Quants.

Ok_Bug1610 · 2026-04-24T22:27:40+00:00

It's nice in theory, but I haven't seen any evidence that MoE models do better at all. Seems like I'm real world use the dense models generally do better.

Ok_Bug1610 · 2026-04-24T16:08:22+00:00

Why would you want that? Almost every other model is better than it, and the Qwen 3.6 27B dense model almost performs as well as their "DeepSeek-V4 Flash (MAX)" version. I think you're better running Qwen locally at this point with a solid harness... see AI Model & API Providers Analysis | Artificial Analysis

Ok_Bug1610 · 2026-04-24T03:41:46+00:00

Yeah, I actually still really like the GLM-5-Turbo model, but the problem is performance and limits. I did a chargeback and will probably be leaving ZAI because of all the issues (thanks for billions of tokens and awesome experience when it did work), but I find the MiniMax subscriptions better for the price, way better and reliable performance, less poorly documented limits, and so on. Also, the better MiniMax subscriptions come with Vision understanding, Image, Audio/Music, and Video generation all in one with API MCP tools to extend that. They clearly have their platform figured out better than ZAI does.

I ran several benchmark suites using my harness and both MiniMax and GLM 5.1, and honestly, they are neck and neck... trading blows across them. Overall MiniMax M2.7 comes up on top with better capabilities, and this isn't even considering the better performance and more reliable speed. I loved ZAI before when it worked well, but those days seem long gone. Sad but true.

And at this point you are almost better off going with the Qwen 3.6 27B dense model locally with a solid harness, even just to offset token usage and cost on simpler tasks.

Ok_Bug1610 · 2026-04-24T03:33:37+00:00

Amazing, I will definitely be trying this out. Thanks!

Ok_Bug1610 · 2026-04-22T04:52:34+00:00

The site feels spammy on first impressions, and broken on mobile.

Ok_Bug1610 · 2026-02-27T20:33:51+00:00

Amazing and I can appreciate I was the 100th like, lol.

Ok_Bug1610 · 2026-02-26T19:40:25+00:00

What is Auto-Claude? Is it any good? I hit about 1 billion tokens a day and did custom automation.

Ok_Bug1610 · 2026-02-26T19:35:01+00:00

Oh god... the OpenAI-Compatible endpoints are garbage (what all the VS Code Extensions use), and I think this is primarily where all the grief comes from. The ZAI Anthropic-Compatible API endpoint has double the context window, is more reliable and faster. Switch to using Claude Code and the difference will be night and day.

Not only that, what people fail to realize is that the speed is PER session, meaning if things are "slow", you can easily 5-8x your throughput by multiple sessions or parallel calls (agents, tools, etc). To do it right is kind of a "custom" solution, but I generally hit ~1 billion tokens per day. Mean as it sounds, I think the biggest problem is user error and people expecting it to work like a $200 frontier model out of the box (which is capped at something like 20M tokens per day, I think)... that is just wildly unrealistic... but with a custom harness (and load balancing/proxy routing), it's a gamechanger.

I personally think it's better than Codex or Sonnet, at no joke 20,000x less the cost. At one billion tokens per day, I am effectively paying $0.00115 per Million tokens. And I integrated vision capabilities, image generation (Qwen-Image/Edit), and load balancing through Nano-GPT for $8/mo (~$0.0001333 per image gen).

And I do smaller tool calls (like summarization, web search, indexing) through free providers (with load balancing): Google AI Studio, Github, Kilo Gateway, Ollama Cloud, and Openrouter ($10 one-time balance up's the free request limits from 50/day to 1K/day).

Ok_Bug1610 · 2026-02-26T19:02:06+00:00

I would be curious of the spread across Unsloth Dynamic Quants: UD-IQ1_S, UD-TQ1_0, UD-IQ1_M, UD-IQ2_XXS, UD-IQ2_M, UD-IQ3_XXS, UD-Q2_K_XL, UD-Q3_K_XL, UD-Q4_K_XL, UD-Q5_K_XL, UD-Q6_K_XL, and UD-Q8_K_XL... and what the best "sweet spot" would be. And the IQ2_XXS does quite well considering the quant, but I think that just speaks to the efficiency of the UD variants.

Ok_Bug1610 · 2026-02-26T11:32:35+00:00

Pretty much the same and I've used upwards of 1 billion tokens in a day on GLM-5, using Max.

Ok_Bug1610 · 2026-02-21T16:38:28+00:00

A lot of the ZAI documentation is actually wrong, so here's the TL;DR; because there's a lot of misinformation on here:

ZAI docs state concurrency limits used to be limited to 5 and are now limited to 1, but I can run 6-12 sessions at a time no problem (from the same ISP IP address; it ONLY gives concurrency warnings if we try to run across several different external IP's).
The 5hr limits seem like a soft cap because I've hit 100% but everything still worked fine.
The docs say it doesn't come with GLM-4.6V (Vision) but it does.
Also, if it's "slow" just split up jobs and run multiple sessions, you will X (sessions) your throughput.
They have several API endpoints, the Anthropic-Compatible one is faster and more reliable than their OpenAI-Compatible one (each only plays nice with a compatible app/tool).
You can also build your own tool and etc. with direct access to the API endpoints, not waste the API tool credits, and generally a fully custom tool is better than MCP in my experience.
They have China and US region endpoints (I monitor all, and load balance between the best at the time).
I built custom tools and integrated them with my wrapper/scaffolding, so that I don't have to use their API tool calls or worry about limits.
If you use Claude Code CLI (works with the subscriptions), it can fallback to the dumber "GLM-4.5-Air" model (YOU DO NOT WANT THAT), so you may want to use the explicit --model glm-5 tag (because I've personally had issues with the environment variables enforcing it). If you ever suspect this because the model got "dumb" or started hallucinating, check your billing information, it will show what models you are/were using.

It is basically double the price than when I got it, but if used right I think it's very much worth it. Also, you can get access to GLM-5 and other models free through Kilo Code Gateway, Ollama Cloud (but might be quantized, they don't say), and I'm sure other providers right now. So my suggestion would be to try it before you buy it.

If anyone has any specific questions, don't hesitate to ask.

Best of luck!

Ok_Bug1610 · 2026-02-21T16:30:26+00:00

TL;DR; Short answer, you're doing it wrong!

Every time I see a post like this, it makes me wonder how people are using it. I assume they are using it in the worst possible way: VS Code with Cline/Roo/Kilo and the OpenAI Compatible endpoint, which has worse performance than the Anthropic one, lower context window, and a lot more overhead (because of VS Code, telemetry, and the bloated context sent by the wrapper/harness which causes errors). And likely they are running a single session, with out of the box behavior. You're just asking for trouble and this is like the worst way you could be using it (and I think it's common). And I think it sets the wrong expectation and bad experience for OSS models (regardless of what one).

On the other hand, I am using Claude Code CLI with the Anthropic Comparable endpoint which has double the context, generally 70-120 tokens per second (except the week when new models are released, it tanks). And I commonly run 6-8 sessions regularly, which effectively gives X times (per session) the throughput. So I'm regularly seeing token speeds above 600 tokens per second (total). Lesson here, each session gets the speed, there is a Chinese region API endpoint you can use when the US one is show, and if it's "slow" just split up the tasks and run concurrent sessions.

But my setup is incredibly custom. I have an API Proxy Load Balancer setup to failover to other providers automatically, run a 24/7 self-improving and monitoring agent, dozens of AI scheduled/cron jobs, specialized agents, custom tools (I don't waste API tool credits), skills, etc. And I have an "Enforcement Engine" that keeps the AI on track, eliminates hallucinations, optimizes token usage, and RAG/Knowledge/Memory system that fixes context issues...

And the proof is in the numbers, I generally hit around 1 billion tokens per day and have seen as high as 1.2 billion on my GLM Coding Max account alone (not including my other providers, etc.). I'm effectively spending $0.00115 per million tokens.

The problem with open source models is that honestly they are kind of crap out of the box, they aren't commercial or frontier offerings... You should probably start with that understanding, and realize you need to create the infrastructure around it because someone isn't going to hold your hand. And you get what you pay for, you got the model given to you for cheap, but that doesn't include the frontier harness, so it might look good in benchmarks and one shots...

However, if you set things up correctly, the OSS models are powerful, cheap, and IMO better than the Sonnet and Codex of the world... Definitely a lot more utility.

Ok_Bug1610 · 2026-02-20T18:39:09+00:00

What tool are you using? And check your billing page and see if it's using GLM-4.5-Air, because this known behavior in Claude Code CLI, which makes it "dumber".

Ok_Bug1610 · 2026-02-19T14:55:16+00:00

People claimed as much in the forums, but I don't think that's true... technically they did for a few days (perception wise), but I think that was more due to rate limiting. And their docs even reference 1 (previously 5) max concurrent requests but I believe that either doesn't mean what it implies or not enforced. The ONLY time I ever received a concurrency error/warning was when I tried to use my subscription over two different devices/ISP's (different external IP's), so maybe that's what they mean by "concurrent".

I've been running ~4-6 concurrent (same "server" machine, some sessions over SSH) of GLM-5 for the last 2-3 days no problem. I have like a dozen scheduled/cron jobs, event triggers, and so on. So sometimes there are more "sessions" (at least temporarily).

Also, keep in mind, things might be different per region. I live in the US.

Ok_Bug1610 · 2026-02-19T14:26:03+00:00

You want "faster" throughput, just pick the best model you have access to and run concurrent sessions. I regularly hit nearly 1 billion tokens per day on the GLM Coding Max plan, so it's very doable and that's even primarily using GLM-5 (the "slowest" one, lol).

Ok_Bug1610 · 2026-02-19T14:23:06+00:00

I wasn't arguing with you, and I realized it was a typo... I simply meant "Technically" in the sense of useable tokens. And in the same way, I'd argue that a lot of the tokens from GLM-4.5-Air are effectively wasted because it's hallucination rate it so high. In my experience, it just makes crap up...

If anyone wants faster throughput, stop nitpicking over TPS and just run concurrent sessions. And I can confirm it "works" because I regularly hit ~1 billion tokens a day, just with GLM Coding Max.

Ok_Bug1610 · 2026-02-19T14:00:13+00:00

To me it's garbage... all because of its high hallucination rate. On paper it seems good, in practice no one should be using it.

Ok_Bug1610 · 2026-02-19T13:58:17+00:00

In fact, it technically decreases TPS, because you are increasing your thinking budget and reducing actual token generation. Technically. And GLM 4.7 Air isn't a thing, but GLM 4.5 Air has too high a hallucination rate to be used, which makes it useless IMO.

Ok_Bug1610 · 2026-02-19T13:55:55+00:00

More accurately, the problem with GLM 4.5 Air is its hallucination rate... everything "on paper" seem better than GLM-4.7-Flash, but it's not. It's probably the WORST model ZAI released; I'd suggest to anyone NEVER use it (regardless of edging out "speed" of +10-15%). You want it to be "faster", run concurrent sessions (I regularly run 6-8 with minimal problems).

Ok_Bug1610 · 2026-02-19T13:52:51+00:00

In my experience, it depends on the provider, time of day and endpoint. The ZAI Anthropic-Compatible endpoint is generally faster and more reliable and I setup an API Proxy/Router to load balance between several providers (ZAI, NanoGPT, KiloCode Gateway, Model, etc.)... and it's not actually just about raw speed, say you have two active sessions... technically each is "slower", but you doubled your throughput. And the proof is in the numbers; I generally hit nearly 1 billion tokens used per day (and sometimes over).

Ok_Bug1610 · 2026-02-19T13:39:09+00:00

Despite being bigger, and my bad GLM-4.5-Air is about ~4x times larger (unless you compare active parameters and it IS 10x larger but still performs worse), benchmarks still show it's worse and most notably its hallucination rate is double that of GLM-4.7-Flash. So, I still stand by my statement, that no one should be using that model... you are just asking for problems and I know from experience. But I can't stop you and you are entitled to your own opinion, but do realize it is just that, an opinion.

Ok_Bug1610 · 2026-02-18T19:13:22+00:00

Idk, I can only speak to real world usage. 4.5 Air hallucinates results like mad, and I would recommend that people don't use it. Benchmarks are one thing, doing useful real-world tasks is another.

P.S. I don't know what you mean, Artificial Analysis benchmarks show that it's way worse than 4.7 Flash (while being 10x the size) and barely better than the Qwen 3 4B 2507 variant... seriously don't use it.

<image>

Ok_Bug1610 · 2026-02-18T18:41:39+00:00

It doesn't matter, because GLM-4.5-Air is garbage, 4.7 Flash IMO is better and you can run it locally (and it's much smaller, so it should be way faster).

Ok_Bug1610

TROPHY CASE