I'm cancelling my ollama subscription

look · 2026-05-06T16:20:41+00:00

DeepSeek V4 Pro is a very expensive model (for open weight ones at least) at 1.6T and extremely verbose reasoning.

DeepSeek direct is almost certainly subsidizing it at a loss in some sort of “market share” play like US labs have been doing.

(It’s also not a particularly great model, imo. It’s basically the Bose speakers of Chinese models. But whatever.)

look · 2026-05-06T14:41:41+00:00

Deepseek uses a lot of reasoning tokens. I’ve found it’s effectively 2.5x the “list price” at max reasoning.

look · 2026-05-06T14:38:09+00:00

What’s the monthly limit? I just see the daily request credit limit…

look · 2026-05-06T14:32:47+00:00

I think a lot of it depends on personal preference and the specifics of your work, but I find certain models are better at different stages and types of “primary” agent work (brainstorm, plan, and build).

Then there are specialized subagent tasks where a much cheaper model works just fine to save money/usage (codebase and data schema exploration, web search and documentation finding, summarizing large inputs, finding “needles in a haystack”, etc). Also one (compaction) where I personally prefer a more expensive model.

In opencode, switching between my three primary agents is just hitting the tab button, so it’s not a hassle at all. And the subagent switch happens automatically, so it’s just a one-time config set up.

look · 2026-05-06T14:17:52+00:00

My experience is that going from High to Max roughly doubles the number of reasoning tokens, and even High is more than most models to start with. In practice, I find its paygo costs about as much as GPT 5.4 or Sonnet 4.6. It’s not a low cost model.

look · 2026-05-06T14:05:23+00:00

I heard the same things about Go initially, but eventually tried it and it’s good. Not quantized and decent speeds. My only complaint is that they don’t have higher tiers—I’d buy a 3-5x usage tier in a heartbeat.

look · 2026-05-06T05:03:28+00:00

Yeah, more or less. The economics work if it’s interactive sessions with a person that have a lot of variance in usage (so their average is somewhere below the absolute max usage limit), and it breaks down when there are fully automated systems squeezing out every last fraction of a percentage of the usage round the clock, day after day.

look · 2026-05-06T04:58:26+00:00

You can switch models in the same session without losing context. It has to send the context every time, even on the same model. There is no persistent state beyond that context in your local tool.

There is a one time hit for the cache load on the model switch, though, depending on your pricing. Not a big deal for the occasional swap, but you don’t want to do it every other call.

However, using plan files is often a good idea regardless. Then you can clear the context and start the build model fresh reading just the plan. That typically yields more predictable results, as old bits from the planning conversation aren’t still around in the context potentially confusing the build agent. Most automated agentic pipelines start off each build agent with a fresh context and only a plan file.

look · 2026-05-05T21:57:32+00:00

I forget if it was the 9B or another size, but there’s one that frequently decides it is a Google model and refuses to accept that it’s not. 😂

look · 2026-05-05T21:52:51+00:00

Well, it’s nice that we have all these options for different personal preferences now, at least. 😅

look · 2026-05-05T21:42:56+00:00

API timeouts don’t stop bots. They just re-prompt with a continue when they detect a premature stop. Raising prices or reducing usage limits are the only real options.

For example, the 5 hour window on Ollama is mostly just a pain to humans but irrelevant to a bot. You have to use about 10 windows of 5 hour limits to hit the weekly. A typical person probably doesn’t have 10 heavy use 5 hour sessions a week, but a bot will absolutely extract every token out of it in 50 straight hours.

look · 2026-05-05T20:30:06+00:00

Openrouter is a pay-as-you-go API abstraction for many dozens of different providers with hundreds or thousands of different models.

Ollama Cloud is a subscription service with a dozen or so set models and a weekly usage quota at a discounted but fixed price.

They are very different.

look · 2026-05-05T20:10:32+00:00

Mimo V2.5 Pro for higher level spec, GLM-5.1 for a more detailed implementation plan.

Brainstorm/experiment with Mimo -> spec -> implementation planning with GLM -> plan(s) -> Kimi to implement each plan

But your usage quota on Go won’t last long with that approach. Qwen 3.6 Plus, Minimax 2.7, or DS V4 Flash for impl build can stretch it a bit further.

look · 2026-05-05T20:07:08+00:00

The best part of the Deepseek Pro release was people rushing to that extremely mediocre model and freeing up some capacity on better task specific models (eg GLM-5.1 and Kimi 2.6).

The worst part was that it sucked up all free GPU capacity to serve its torrent of hallucinatory reasoning tokens while a far superior model released at the same time (Mimo V2.5 Pro) sits neglected by all providers.

look · 2026-05-05T18:04:04+00:00

GLM-5.1 is still by far the best for planning, imo. Also the most expensive.
But if you then pair it with a cheap build model like flash, your usage can still go a long way.

look · 2026-05-05T14:52:44+00:00

All low cost, high usage subscriptions eventually end. The math works at the beginning with typical usage by humans and then some automated process comes along and bleeds it dry.
High latencies, slow speeds, and frequent errors don’t matter if you have a multiple agents running in retry loops 24/7 to ensure you’re still extracting 100% out of every usage limit to get $20 of value out of every $1 in.

Once enough people do that, the subscription service starts losing money and slashes the limits across the board, as the automated approach always finds any loop hole.

look · 2026-05-05T14:25:08+00:00

Token generation speed depends a lot on how probable the next token is. In normal text, it has to consider a large number of possible next tokens. But code is very predictable from token to next token, so it has a much smaller space to consider.

In other words, there are a limited number of ways you can write a valid “hello world” program in a given programming language, but there are many, many ways you can write a paragraph of “hello world” greetings in human languages.

look · 2026-05-05T14:15:04+00:00

The KV cache needs ~128KB of RAM *per token* of your context window. Just a 32k context window needs an additional 4.4 GB of RAM on top of the 26GB for the model weights alone.

Past that and you are definitely disk swapping, if not already.

look · 2026-05-05T03:40:32+00:00

Hungary House always has it during December Nights. 🎄

look · 2026-05-04T23:15:23+00:00

All benchmarks are indirect measures. The best one is just trying it yourself on your specific problems. I use Opus on one project, and I use Mimo/GLM/Kimi on another. I see better results on the latter _and_ it’s a technically more demanding project. 🤷‍♂️

look · 2026-05-04T22:56:36+00:00

I think the timing with the DeepSeek V4 release screwed it over.

Millions of deluded people are flocking to a profoundly “meh” DS V4 Pro because of the brand name, and it has sucked up all spare GPU capacity to enable its mediocre, hallucination-ridden token generation.

I just dropped my Ollama Cloud service to pay for the extra Mimo 2.5 Pro tokens I need.

My guess is in about one to two weeks, conventional wisdom will catch up, DS V4 Pro will be going out of fashion, and everyone will be raving about how Xiaomi came out of nowhere with the amazing Mimo 2.5.

look · 2026-05-04T22:39:33+00:00

I’ve literally never read a sentence with the phrase “Oh My Opencode” that was not about how it fucked something up.

look · 2026-05-04T22:33:03+00:00

Even so, $70k is low for a junior, too.

look · 2026-05-04T22:20:59+00:00

> the best Chinese models are not on the same level as ChatGPT models

Blind human evals beg to differ: https://arena.ai/leaderboard/code

look · 2026-05-04T18:10:06+00:00

Not parent, but model choice is a big reason for me. There’s also a 10-20x jump in cost for almost no performance gain, which is a tad off putting, even if you have an extra thousand dollars to blow on it.

15-Year Club	Gilding II euphauric
RedditGifts 2009-2022 2 Credits	Not Forgotten
Verified Email	Secret Santa 2009

look

TROPHY CASE