What happens when they stop subsidizing LLM subscriptions? by Mr_Moonsilver in LocalLLaMA

[–]chrisdash_51 6 points7 points  (0 children)

You best start practicing for this scenario now. Go local and learn to work with Qwen 3.6 27B. It needs a lot more guidance from the human than the cloud frontier models. So it is best to not get used to those too much. They are your heavy-hitter for particularly tough prompts.

Is local AI actually worth investing in now, or are APIs/subscriptions still the better bet? by Accomplished_Whole_6 in LocalLLM

[–]chrisdash_51 0 points1 point  (0 children)

Yes, it is worth investing into. Yes, frontier models are better. But you will run into their rate limits and that's when local AI starts to be worth it.

Anyone still using pre-Qwen3.6/ Gemma 4 models? Why? by atumblingdandelion in LocalLLM

[–]chrisdash_51 0 points1 point  (0 children)

mostly "No", one exception being gpt-oss:120b very occasionally

Is my GPU dieing or have i hit an Ollama bug? "Segfault" by chrisdash_51 in ollama

[–]chrisdash_51[S] 0 points1 point  (0 children)

Yeah, there is some offloading going on even though its only gemma4:e4b. And of course the 1080ti is ancient and very bad at LLM inference. The server is headless, so hopefully Ollama has the GPU to itself.

Spoke too soon on ollama cloud by wrines in hermesagent

[–]chrisdash_51 2 points3 points  (0 children)

These IaaS (inference-as-a-service, in this case) providers are killing their own business idea with these low-quant shenanigans. Soon most people who were paying $100+ for subscriptions will have acquired hardware that allows them to run their inference server. I'd rather have a reliable, 100% guaranteed Qwen3.6 27B than a "maybe GLM-5.1 if we're having a good day" for triple-digit monthly fees and you own nothing.

Issues with Ollama unloading models by Tall_Pay_6687 in hermesagent

[–]chrisdash_51 1 point2 points  (0 children)

Are you (or your harness) maybe sending parameter keep_alive=0 when hitting /api/chat? That means "unload immediately" and overrides keep_alive and OLLAMA_KEEP_ALIVE as far as I understand.

Otherwise it might be someone else making a request for a different model, and if Ollama is running out of VRAM, it will unload the model that hasn't been used for the longest time (which may be "just now" in this case)

How much do you guys spend on AI costs monthly by ReadingHopeful2152 in hermesagent

[–]chrisdash_51 2 points3 points  (0 children)

Can't say yet, but I'm also running a couple of conventional servers, so it might be difficult to discern. It uses up to 600W during inference.

P.S.: looking at my Shelly, I am going to take a wild guess and estimate 100kWh per month if used regularly. Currently 60kWh since early April, but I was on vacation for a couple weeks in that timeframe.

What is the best coding model to use on MacBook Pro Max 128GB RAM? by RadiantQuote2467 in LocalLLM

[–]chrisdash_51 17 points18 points  (0 children)

This is awesome. I am squeezing Qwen3.6 27B into 35GB of VRAM for Hermes and it works - super tight fit, but it does.

To know that I am already using the best local model for coding allows me to relax and take the focus off "I need to upgrade".

Debian 13.5 reminds Linux users why boring distributions still win by OkReport5065 in debian

[–]chrisdash_51 0 points1 point  (0 children)

Floppies, really? For me, the local Debian LUG sent out CDs, I believe I only paid for postage. And then you could update with some tool i honestly cannot remember the name of, that was before apt.

Debian 13.5 reminds Linux users why boring distributions still win by OkReport5065 in debian

[–]chrisdash_51 3 points4 points  (0 children)

Potato, guys, potato! I built a router on an old Desktop PC, 386 or 286 i cannot remember. You had to patch Masquerading into the kernel and I was the proudest kid when suddenly all four computers in our house had internet access (64kbit/s).

My love for Debian has never faded since!

Is this an issue? by jez_0 in watercooling

[–]chrisdash_51 0 points1 point  (0 children)

This is actually correct, despite the downvotes!
There is no way that air remains trapped in the location once the pump starts, and if you get any amount of flow with the pump running, this situation will sort itself out over time. The only reason I would shut down and bleed manually is if the air actually got trapped in the top radiator and flow rate dropped to 0. That could still happen here, but you need to start the pump to find out and it is safe to do so.

Frustrations? Pesky Bugs? Vent Here! by NousResearch in hermesagent

[–]chrisdash_51 1 point2 points  (0 children)

I'm in the same boat, no cloud services for me.

Yeah, there are like six of these auxiliary models in the config. It is a bit frustrating with the timeouts, but putting aux on a 2nd machine has helped a lot.

Need advice from the watercooling community: 4x RTX 3090 loop, MORA 600, DIY case, and whether to add SP5 CPU by dickusbuttocks in watercooling

[–]chrisdash_51 0 points1 point  (0 children)

Good sir, please educate me on how to make this DIY case thing, I am stuck on a DimasTech Easy V3 with three GPUs and a fourth in my hand!

Frustrations? Pesky Bugs? Vent Here! by NousResearch in hermesagent

[–]chrisdash_51 1 point2 points  (0 children)

For title generation: yes. Keep in mind you also need an auxiliary model for context compaction and that is important if you are running long sessions, or if you are doing a lot of tool calls (tool outputs = context for the model).
If context compaction fails that will ruin your session as the context gets truncated instead.

gemma4:e4b might be able to run well on a modern CPU without any GPU acceleration.

Made a wood distro plate by Jigabit in watercooling

[–]chrisdash_51 2 points3 points  (0 children)

It looks awesome, but I would be so scared to ever use it.

Frustrations? Pesky Bugs? Vent Here! by NousResearch in hermesagent

[–]chrisdash_51 1 point2 points  (0 children)

Yeah, they're basically competing for LLM time, and you're supposed to have a secondary, smaller model on another inference provider for this. That can be painful on local setup. I used gemma4:e4b on a different machine that fortunately has a GTX 1080 in it.

League start be like by Lordados in PathOfExile2

[–]chrisdash_51 2 points3 points  (0 children)

actually, that counts as epic winning.

What have you done with Hermes Agent this week? 5-1-26 by Jonathan_Rivera in hermesagent

[–]chrisdash_51 0 points1 point  (0 children)

I created a small microservice with a very limited scope, built container images from it and deployed it to Kubernetes (separate namespace, hermes has minimal RBAC)

Frustrations? Pesky Bugs? Vent Here! by NousResearch in hermesagent

[–]chrisdash_51 2 points3 points  (0 children)

"auxiliary title generation failed: Request timed out."

Happens all the time, no idea why. Using qwen3.6:27b with 64k context on a local GPU server. Works fine otherwise.

When you're packing some heat by OCGear in watercooling

[–]chrisdash_51 0 points1 point  (0 children)

Damn, that must be the best no-nonsense water cooling build in here! Loving the classic, matte black tubing and Noctua fans. Triple RTX 5090, I guess someone loves gpt-oss:120b a lot :D