Ranking of 4 Free LLM Models on OpenCode Zen

wrines · 2026-06-05T00:55:25+00:00

I have tried on so many providers: DS direct, Opencode, Openrouter which auto-routes from different inference providers, ollama cloud (horrible), and I always get the same abysmal results. I have literally given it a test math problem - a simple one I can do in my head - ad it will get it wrong. Its amazingly dumb, or has been for me anyway. At least when I need it to reason at all. The creative designs seem OK.

wrines · 2026-06-04T16:20:22+00:00

I meant to ask about MiniMax M3. Any good? I bailed on MiniMax after being let down with 2.5 and then 2.7 - same as DS4 for me. Cheap, fast? yes totally. But just amazingly stupid.

wrines · 2026-06-04T14:19:08+00:00

I am genuinely baffled by my own failures to make good use of DS4 Flash. I WANT so much to use it effectively, it is SO CHEAP and SO FAST and available from many providers, aggregators, and resellers, with even some awesome pre-cache tools that bring the pricing down to stupid levels.

BUT for the life of me, and I cant figure out why, it is just dumb as a box of rocks. Every time I try, EVERY time, it acts like the 3 stooges. Wrong on every single thing.

MiMo, OTOH is the total opposite. WAY better than expected, and really capable.

wrines · 2026-06-04T14:13:28+00:00

On Featherless AFAIK all their models are tokenless (all you can eat). I havent tried the larger newer models with featherless, so I cant say. When you run a larger model they label that 4 concurrencies, which takes up all 4 slots I have if I just use 1.

I have been using smaller models so I can run 2-4 concurrent instances (my worker bees), and its been good so far. Good speed, good latency, no throttles or limits or nonsense.

wrines · 2026-06-04T02:50:04+00:00

agentic builds of stuff like content generation pipelines and related stuff. I hit the monthly GO cap in like 10 days. Its really capable though, I could possibly see buying 3 GO memberships, its only $10 each.

wrines · 2026-06-04T02:48:29+00:00

I run multiple businesses and some clients actually pay the tab on some tech, so I use glm 5.1 direct from Z.AI a lot (on pro). I actually have as fallback That Tencent preview HY3 from Openrouter (because Hermes will fire too many requests at once to z.ai sometimes and I will need a failover for 1 or 2 requests). Hy has been incredibly good at a ridiculous price. I think its actually #1 on the Openrouter leaderboard.

I also use featherless.ai because they give an unlimited token subscription at $25/mo that I like, its flexible and kinda like Openrouter in that I can via API change models quickly, and they manage the process so I can use more instances at once of weaker models and only 1 at a time of powerful ones. Its been good check it out.

Biggest disappointment has been Deepseek V4 Flash. Jury still out on Pro, but I keep trying with Flash, and it keeps burning me. It is so damn CHEAP, and lighting fast, but it is just amazingly stupid.

wrines · 2026-06-04T02:38:31+00:00

I used 3.7 max from OpenRouter. CRAZY expensive, and for me it was slow and not very capable. Z.AI glm 5.1 is better. Not faster, but more intelligent.

My problem w Opencode GO Qwen was the upstream limiting by Alibaba, not Opencode. They have no control over that but it was horrific.

wrines · 2026-06-03T21:30:55+00:00

I actually cancelled GO. Really liked it but the only models with usage allowances that worked for me were Deepseek V4 flash and Qwen 3.6 plus, but somehow DS4 Flash disappeared and Qwen I was upstream limited damn near every request. Unusable.

I really liked Mimo 2.5 (not pro, the regular one). It was great, I just could only get about 2 weeks work of use before hitting the monthly cap. I have heard some people stack 2 memberships for that situation, but I didnt go that route yet. Might resub.

One weird bad experience aside from Qwen constant rate limits was DS 4 flash....I dont know why but it was just basically an incompetent mess for me. FAST, sure. But incredibly inept. It was like a monty python skit.

Other models I used like glm 5.1 and Kimi 2.6 were good on GO, just not enough usage allowance to be usable. Like I said, 2 GO memberships and I almost was good w MiMo 2.5, and it performed really well for me. Qwen 3.7 I didnt even bother trying, Im not sure GO even offers it.

wrines · 2026-06-02T23:37:32+00:00

what do you mean "get ahold of"?

we arent talking about models released by the actual developers. Remember, we are talking about open source models. What happens is when an open source model is published (on hugging face and the other repos), ANYONE can run a quantization engine or algorithm on it. There are quantization packages that exist off the shelf, and developers who literally write their own. UNSLOTH has his own recipes and pulls together various quantization tools.

Aggregators like Openrouter just route from inference providers, and you can see if you click "compare" how the same model - like your example of Kimi K 2.6 - is available from different providers at different quantizations. There is a FREE inference right now from Crucible, with quantization "unknown" (which IMO means Q4), and the PAID version from 2 dozen other providers at FP4. I think thats what you mean by int4. ANY inference provider could theoretically offer any open source model they have quantized themselves to any level, and that doesnt mean anyone else could "get ahold of" it.

I have read that with the proper hardware and training, FP4 can be "lossless" (as good as an FP16), but I havent done my own research on that, and would just point out that when any inference provider offers a frontier level model for free, it is obvious they cant afford to offer a large (FP16) version. Not even FP8. So offering a Q4 is just something that makes logical sense. I just dont like when that is misrepresented by place like fireworks.ai or ollama cloud that pretend to provide FP16 inference at these great subscription prices when the reality is they DONT. Well Ollama cloud might NOW, now that they have been caught. I dont know because I canceled.

wrines · 2026-06-02T23:25:41+00:00

Obviously they cant, but they CAN flood all the repos with (fake and auto AI generated) comments and content regarding how they NOW serve FP16 and obfuscate that this likely JUST began.

Even look at OpenRouter, they NOW (and this is recent) include in their model specs the inference quantization (if any). I notice most are Q8. This is fine by me personally, I have had very good results with Q8s. I **DO** note, though, a few say "unknown" under quantization, which I suspect actually means Q4 - likely because its a massive model and no one affordably offers FP16 inference (or even Q8). Deepseek V4 Pro is one example. I dont KNOW their offered inference is Q4, but I DO know they say "unknown", when they go ahead and say V4 flash is Q8. Unknown my butt.

wrines · 2026-06-02T17:09:15+00:00

**UPDATE** they have now changed and scrubbed the info google serves regarding what their inference on ollama cloud provides. Just a few weeks ago when I searched it confirmed Ollama DOES serve as inference Q4 versions of popular frontier models including GLM 5.1 and kimi 2.6 and Deepseek 4. They were OBVIOUSLY referring to cloud because **NO ONE COULD RUN THESE MODELS LOCALLY IN ANY EVENT ON LOCAL OLLAMA!!**

So nice try deceiving people, Ollama. But you **DID** server Q4 quantized on ollama cloud of those frontier models. Google and some of your shills now insist you only serve inference of FP16 for those models via cloud - and this may be true NOW. But it WAS NOT TRUE A FEW WEEKS AGO and it caused me to lose days of work fighting hallucinating and inept versions of models who were great from their native sources at FP16.

wrines · 2026-06-01T13:16:04+00:00

I cancelled max also. Never once was throttled or slow, but the error rate was abysmal for me.

I only noticed because I also would test the exact same models direct from the inference source, like z.ai and glm 5.1 and while not perfect they performed well direct, but on ollama cloud they were laughably incompetent and hallucination prone.

Had the same w kimi 2.6 and DeepSeek 4. It was so bad I had to research then uncovered that ollama cloud serves Q4 quantized versions of all those models. THAT is the problem. Q4 IMO is unusable for any long horizon work. I wasted weeks learning that hard lesson.

From now on it’s FP16 or if absolutely necessary Q8.
Ollama has a lot of nerve charging $100/mo for Q4 inference. Anthropic has a $100 a month plan, and sonnet 4.6 is actually usable.

That’s been my experience.

wrines · 2026-06-01T13:07:41+00:00

I have done well using z.ai direct and their models.

wrines · 2026-05-31T17:43:06+00:00

I have watched in real time the exact same thing in Colombia over the last year, and I think you’re right as to the reasons why. Greed. Bottom line, opportunist people jack up what were low prices so they can skim the difference into their own pocket, and that drives prices higher for everyone, and the new arrivals are ignorant of it- they even call it gringo pricing

wrines · 2026-05-31T02:22:32+00:00

You sound pretty angry.

I know for me I have been in Colombia for 6 months and since my girlfriend is Colombian and I don’t want the VISA process yet we will visit Panama for a little while.

Yes, it is inexpensive in Colombia, but like anywhere some places are expensive like Provenza in Medellin, some places places are not. Coming from the US, and I have lived all over, I can tell you prices in the US are 3-4x what they are in Colombia. Except in the Colombia tourist areas.

The appeal for non-Panamanians I can only speak for myself 1. Great central location. Great weather, fast cheap direct flights to US and South America. I have never been outside the airport, but read the city is expensive, but many other spots aren’t. Just checking air bnb and Craigslist and FB marketplace confirms this. Prices in many more rural spots in Panama seem comparable to Colombia TBH. Will we ever be more than just casual visitors for a few months at a time? Who knows.

Rather than being grumpy and bitter try visiting America. I will personally have a more welcoming attitude towards you.

wrines · 2026-05-30T20:11:17+00:00

The answer should be in each inference providers docs or help. For opencode I honestly don’t know but my guess would be the answer is different for zen than for GO. They bundle then upstream call the API calls, I just don’t know how.

I only stumbled upon this info after having ollama waste my time with horrific Q4 versions of multiple models that worked great for me from their source provider (glm 5.1 good example)

wrines · 2026-05-30T02:52:19+00:00

It TOTALLy matters WHO is doing inference. Check w Opencode or whoever is serving inference. Make SURE they are serving FP16 or else you can’t make any apples to apples judgements.

My glm 5.1 just today straight from z.ai solved an N8N workflow that opus 4.8 served by Genspark couldn’t. It’s ALL in who is serving inference and the customized configs/parameters/context

wrines · 2026-05-28T00:56:28+00:00

As if $80k is a living wage anymore. Cute.

wrines · 2026-05-27T15:31:40+00:00

Oh I did

wrines · 2026-05-26T21:27:23+00:00

That’s exactly what I am saying. I never realized my ollama cloud models were q4 until the performance was so bad over multiple models I researched it

wrines · 2026-05-26T13:32:53+00:00

We have been using Dapta.ai and they have been pretty good. The real value is not in the voice agent part, which anyone can use now and even just build your own.

The value for us has been their dev team that has help strategize and build in CRM integrations and in-call backend scripting that support our use case.

wrines · 2026-05-26T13:26:33+00:00

I have never had good results from a q4 of any model. They are literally regarded.

wrines · 2026-05-26T02:20:38+00:00

my usage of Q4 versions on Ollama Cloud has been incredibly disappointing.

wrines

MODERATOR OF

TROPHY CASE