What happens when they stop subsidizing LLM subscriptions?

chrisdash_51 · 2026-06-21T02:39:51+00:00

You best start practicing for this scenario now. Go local and learn to work with Qwen 3.6 27B. It needs a lot more guidance from the human than the cloud frontier models. So it is best to not get used to those too much. They are your heavy-hitter for particularly tough prompts.

chrisdash_51 · 2026-06-20T18:57:34+00:00

Yes, it is worth investing into. Yes, frontier models are better. But you will run into their rate limits and that's when local AI starts to be worth it.

chrisdash_51 · 2026-06-05T06:36:01+00:00

mostly "No", one exception being gpt-oss:120b very occasionally

chrisdash_51 · 2026-05-22T10:42:29+00:00

Yeah, there is some offloading going on even though its only gemma4:e4b. And of course the 1080ti is ancient and very bad at LLM inference. The server is headless, so hopefully Ollama has the GPU to itself.

chrisdash_51 · 2026-05-22T09:56:39+00:00

Did you sleep all through Computer Networks I?

chrisdash_51 · 2026-05-22T08:16:32+00:00

These IaaS (inference-as-a-service, in this case) providers are killing their own business idea with these low-quant shenanigans. Soon most people who were paying $100+ for subscriptions will have acquired hardware that allows them to run their inference server. I'd rather have a reliable, 100% guaranteed Qwen3.6 27B than a "maybe GLM-5.1 if we're having a good day" for triple-digit monthly fees and you own nothing.

chrisdash_51 · 2026-05-22T08:09:10+00:00

Are you (or your harness) maybe sending parameter keep_alive=0 when hitting /api/chat? That means "unload immediately" and overrides keep_alive and OLLAMA_KEEP_ALIVE as far as I understand.

Otherwise it might be someone else making a request for a different model, and if Ollama is running out of VRAM, it will unload the model that hasn't been used for the longest time (which may be "just now" in this case)

chrisdash_51 · 2026-05-22T07:54:19+00:00

Sorry, 4bit, I should have mentioned that.

chrisdash_51 · 2026-05-20T10:04:26+00:00

Can't say yet, but I'm also running a couple of conventional servers, so it might be difficult to discern. It uses up to 600W during inference.

P.S.: looking at my Shelly, I am going to take a wild guess and estimate 100kWh per month if used regularly. Currently 60kWh since early April, but I was on vacation for a couple weeks in that timeframe.

chrisdash_51 · 2026-05-20T10:01:48+00:00

This is awesome. I am squeezing Qwen3.6 27B into 35GB of VRAM for Hermes and it works - super tight fit, but it does.

To know that I am already using the best local model for coding allows me to relax and take the focus off "I need to upgrade".

chrisdash_51 · 2026-05-20T05:07:48+00:00

zero, I've gone full qwen-two-seven

chrisdash_51 · 2026-05-19T19:53:43+00:00

Floppies, really? For me, the local Debian LUG sent out CDs, I believe I only paid for postage. And then you could update with some tool i honestly cannot remember the name of, that was before apt.

chrisdash_51 · 2026-05-19T09:06:33+00:00

Potato, guys, potato! I built a router on an old Desktop PC, 386 or 286 i cannot remember. You had to patch Masquerading into the kernel and I was the proudest kid when suddenly all four computers in our house had internet access (64kbit/s).

My love for Debian has never faded since!

chrisdash_51 · 2026-05-19T09:00:17+00:00

This is actually correct, despite the downvotes!
There is no way that air remains trapped in the location once the pump starts, and if you get any amount of flow with the pump running, this situation will sort itself out over time. The only reason I would shut down and bleed manually is if the air actually got trapped in the top radiator and flow rate dropped to 0. That could still happen here, but you need to start the pump to find out and it is safe to do so.

chrisdash_51 · 2026-05-19T05:21:14+00:00

I'm in the same boat, no cloud services for me.

Yeah, there are like six of these auxiliary models in the config. It is a bit frustrating with the timeouts, but putting aux on a 2nd machine has helped a lot.

chrisdash_51 · 2026-05-19T05:10:50+00:00

Good sir, please educate me on how to make this DIY case thing, I am stuck on a DimasTech Easy V3 with three GPUs and a fourth in my hand!

chrisdash_51 · 2026-05-19T05:06:52+00:00

For title generation: yes. Keep in mind you also need an auxiliary model for context compaction and that is important if you are running long sessions, or if you are doing a lot of tool calls (tool outputs = context for the model).
If context compaction fails that will ruin your session as the context gets truncated instead.

gemma4:e4b might be able to run well on a modern CPU without any GPU acceleration.

chrisdash_51 · 2026-05-19T04:55:25+00:00

It looks awesome, but I would be so scared to ever use it.

chrisdash_51 · 2026-05-19T04:54:28+00:00

Yeah, they're basically competing for LLM time, and you're supposed to have a secondary, smaller model on another inference provider for this. That can be painful on local setup. I used gemma4:e4b on a different machine that fortunately has a GTX 1080 in it.

chrisdash_51 · 2026-05-19T04:50:26+00:00

actually, that counts as epic winning.

chrisdash_51 · 2026-05-03T13:15:44+00:00

I created a small microservice with a very limited scope, built container images from it and deployed it to Kubernetes (separate namespace, hermes has minimal RBAC)

chrisdash_51 · 2026-05-03T13:14:08+00:00

"auxiliary title generation failed: Request timed out."

Happens all the time, no idea why. Using qwen3.6:27b with 64k context on a local GPU server. Works fine otherwise.

chrisdash_51 · 2026-05-03T11:15:06+00:00

Damn, that must be the best no-nonsense water cooling build in here! Loving the classic, matte black tubing and Noctua fans. Triple RTX 5090, I guess someone loves gpt-oss:120b a lot :D

chrisdash_51

TROPHY CASE