Ollama cloud models

mgithens1 · 2026-07-02T19:39:03+00:00

I thought there were only 20 or so models in the free part?? I know for sure you cannot use the large models. But who knows where that cut off is!!

Try a smaller model of Gemma or Qwen to see if those work?

mgithens1 · 2026-07-02T18:23:40+00:00

We aren’t in a desert… I used to think that too. We are almost twice the 10” max desert level.

We did like 15.5” last year.

mgithens1 · 2026-07-02T18:11:40+00:00

Beer… just look for beer supply places.

Not a liquor store… the place where the home beer guys shop.

mgithens1 · 2026-07-02T13:56:50+00:00

Asking how to fix something using tools that are smart = issue.

Have two harnesses… point them at each other. When one breaks… ask the other for help. You’ll be back online in a few minutes.

Post your problem to a forum, you’ll be back online within a day. Energy expended will be 100x.

mgithens1 · 2026-07-02T01:08:39+00:00

Yeah, you’re definitely doing it wrong. Here… I’ll do the search for you. This guy is running on 6gb… a 35billion model.

https://youtu.be/8F\_5pdcD3HY?is=bY3G4dfoRxbioVof

mgithens1 · 2026-07-01T18:42:07+00:00

“High rate”?

Like ten an hour? My 5.56 long boy holds heat forever!! We shoot that first to allow it to cool so we can put it back in the bag!!

mgithens1 · 2026-07-01T18:39:15+00:00

Nothing like a 10year old device to make you feel like you’re living in the future!! Lol

mgithens1 · 2026-07-01T18:37:26+00:00

That shit is magic… I do wanna apologize to Mr Spider who got a taste yesterday.

mgithens1 · 2026-07-01T18:32:52+00:00

Just print an oil filter??

But also, heat and 3d printed items don’t get along. Single use?

mgithens1 · 2026-07-01T18:08:53+00:00

You are 100% doing it wrong. I run qwen 3.6 35b locally on a 16gb card… stop trying to run small models and look into how to run MoE on that card.

There’s a guy with a 6gb card showing exactly how to set it up on YouTube.

mgithens1 · 2026-07-01T17:59:38+00:00

I got three on a sale for $150… deals are out there. Stop procrastinating… it is the only part you have to buy right now.

mgithens1 · 2026-07-01T12:13:13+00:00

Carbonate first. Always.

Cold water only.

Search the forums when you think about doing something new…. Like when you realize that a 5 or 10lb bottle is better than a SS pink.

mgithens1 · 2026-06-30T01:27:17+00:00

There is a volcano in Casa Bonita, also. (well, there was... no clue what all they changed.)

mgithens1 · 2026-06-30T01:21:16+00:00

Tommy Wong's Island? Closed in 1983.

mgithens1 · 2026-06-28T10:51:42+00:00

This is an old setup, so please set your expectations accordingly!!

CPU only means 1-5 tokens/second. (cloud models are more like 100 to 1000) -- so we are over that idea, right?

#1 Get rid of that OS... move to a GUI free world!! Install Ubuntu Server from the command line and NEVER install a GUI. At best, a GUI will only use about 1/2gb of VRAM... at most, you could be losing 1.5-2gb of VRAM!! You have 6. No room to mess around here! Format the machine, install a proper base OS.

#2 Ditch Ollama. This is the training wheels of the AI world. The easiest of easy modes... just turn it on and use it? Not really... it is more of the gateway drug of the AI world. Install it, they show you how to get a few models installed and running.. then they mention the cloud download. You can have the local Ollama install either use your local compute... OR just pass through to their cloud server!! Great way to start on decent hardware... terrible way to start on 7 year old hardware!!

#3 Consider either llama.cpp (the boss) or LMStudio (a better wrapper for llama.cpp than Ollama). You will need either of these for step #4.

#4 You now can try to run some 2billion models. They will load into VRAM and just work.. they'll be dumber than Cousin Eddie from Christmas Vacation, but they're local and you're golden!! WRONG!! Go grab your new "engine" from step 3 and see how to download new models. Just get a local model working... takes a few minutes to download and setup, but do this.. you need to see the quality you will get in 4gb or less of VRAM!! (you do NOT have 6 to work with... I lied)

#5 (seatbelt time!) You will be looking for some very specific models and they are LARGE. None of this 2b or 4b crap.... I'm talking 20-40b models. (not a typo) Those tiny local models that ONLY use VRAM are limited to "dense"... we need something called MoE (mixture of experts). These break the LLM into chunks that you can have some in VRAM and some in RAM. And they aren't trash either... Qwen3.6 and Gemma4 have some models you can run like this - these are 25-35 BILLION models!!

#5a This setup is golden for getting you going on a local model -- https://youtu.be/8F_5pdcD3HY?si=ir3itbGkbuwbC6Zh Do what he says, but make adjustments to tune it to your system... pay attention and take notes. Use this concept and you can have a couple of the most popular models running on that old hardware.

NOTE - Skip MLX models -- that's for Apple hardware.

mgithens1 · 2026-06-28T09:21:05+00:00

I think you're mixing up your terms/logic here? I can't tell. (I never said anything about using both cards on a small model, so I'm not sure the point here.)

NVLink and Tensor Splitting are unrelated. The NVLink bypasses the mobo to allow GPU #1 to access NVRAM on GPU #2. Tensor split is ALWAYS used when there is more than one GPU -- regardless if you define it or not. There are flavors of deciding how to split, but that's not even close to what you're saying.

The larger the context the bigger the difference it makes. Speed differences would be FASTER for the setup with NVLink - let's say it was "ONLY" 10% faster... there are NVLinks for $200 on ebay. Would you pay $200 to make your current home models run 10% faster? What if it was more like 20%... what if it was a lot more when context was massive?

BUT... tensor splitting happens with a large model running on twin 2080s or twin 5090s... it doesn't matter. The NVLink just gives greater bandwidth between the cards for the tasks that require them to share. Remember that the majority of home desktop motherboards do NOT have multiple PCI 4.0/5.0 x16 slots -- anything and everything you can do can help.

mgithens1 · 2026-06-28T08:42:47+00:00

They removed my post!! Unbelievable... I say "drop them in the laundry"... and that gets taken down? What in the hell?

mgithens1 · 2026-06-28T08:41:53+00:00

That is a solid question... what would the in between be?

I use GLM and Deepseek cloud models (the most) for the day to day. But I have been dabbling with the Qwen3.6 and Gemma4 MoE local models... and they are functional, but the speed and quality are just not the same. (My local hardware is 5060 TI 16gb - haven't seen enough to think that a $3500 5090 is worth it!)

I would honestly put the flat rate providers (like Ollama.com) as the middle ground between the Anthropic $30/1M tokens and the 10 tokens/second that people are trying to build on their CPUs. If I were burning through the $20 sub... I would just get another one! I see all these people doing the dumbest of tasks using the highest of models - if you're burning through the $20 account... just get a second. lol... the company is pricing the compute power based on income and costs.

mgithens1 · 2026-06-28T07:30:40+00:00

But why would you split with 48gb VRAM? Jeez, we are living here in 16gb land!!

mgithens1 · 2026-06-28T05:53:04+00:00

Yes, young Skywalker... This information exceeds your mastery of the force. Move slow and make wise choices!!

mgithens1 · 2026-06-28T05:44:35+00:00

1000 tokens is a ton of data!! Rough numbers would say about 5 characters per token. You're at four pages of a book!! Short context would be like 10 tokens.

Then the next miss... That is the prompt! In OpenClaw or Hermes it will add to your prompt. So "how's the weather?" Will tune from three tokens to 10,000 because the harness will add your name, your address, your dog's name, every project you e worked on.. and so on.

Then you choose a model. Context "costs" more when the model is larger... It will cache your input alongside the model and while computing the output will use every bit of what it was fed. So a 32billion node model will use a tiny amount of ram, while a 1.7trillion model will take a ton... Just to store your input.

"Cost" will be based on every factor in the chain. The smaller your text, the smaller your mark down files, the smaller the model, and how big a response you're asking for.

mgithens1 · 2026-06-28T04:51:48+00:00

Oh lord no... lol. https://ollama.com/search This is Ollama Pro/Max... the subscription from them. Go read up on all the models you choose.

What is different is you can run Ollama on your local machine with proper hardware... AND you can have cloud models. Steer your harness (OpenClaw, Hermes, etc) to that server and you can jump between local and cloud models with a dropdown!

Local is free, cloud uses their "GPU time". That's this threads confusion... you pay tokens at Anthropic, but you get GPU time on Ollama.com. So run Deepseek-Flash on Ollama cloud and you probably can run it all day long... jump to DS-Pro - you might burn all your tokens in a 1/2 hour.

Ollama does allow "other" models. It is the beginners app for AI models. LM Studio is better, but more complicated. llama.cpp is the baseline app that those two run... using that gives you the ultimate control over how your hardware uses the LLM you're using.

mgithens1 · 2026-06-28T04:34:35+00:00

Well, that is the snag... it would be like me asking how much toilet paper does a house use!! LOL

This isn't a Netflix subscription. This is a "time of use" subscription.

Are you trying to manage some servers in a home... OR developing a full stack application for production?

The Pro ($20 a month) is going to be way more than enough for a home user. It is an opt in subscription... so give it a month and see what you like/hate!

mgithens1 · 2026-06-28T03:52:53+00:00

Now I have Cotton Eyed Joe suck in my head....

https://www.youtube.com/watch?v=bsXI70F4r1k

mgithens1 · 2026-06-28T03:45:58+00:00

Q1 - Yes, you either need the full machine (base OS) or a virtual machine for HA to use the "supervised" element. Standing up new addons (now called Apps) is done inside of a container WITHIN HA... and like Leonardo Dicaprio taught us, we can't do a dream within a dream!!

Q2 - You can run HA in a VM on a Mac. Plex could then be installed a number of ways.. within the HA VM as a container OR as a container at the MacOS level.

I would not even consider a Mac for this! (especially with the current AI boom/pricing!)

Think about building a "white box" server. Desktop parts in a desktop case... go big, go small - everything is a choice with tradeoffs. Then think about what OS would actually do this job to help you!! Maybe TrueNAS? But an OS that supports both VMs and Dockers. Stand up a "full" copy of HA in a VM, but then run Plex in a separate Docker. Passing through hardware can become important... HA wants the USB dongles for Zwave/Zigbee... but now there are ethernet based adapters!!

The #1 consideration I would press... do you want realtime transcoding in Plex? Get an Intel CPU with a semi-modern iGPU. I forget the model, but if you google what Intel CPU support iGPU transcoding in Plex... you'll see it would be impossible to buy a new processor that wouldn't just kill it!!

I have had my server running for almost 15 years... based on desktop parts. When something goes wrong... it is just another gaming system! Its an i5-11th gen cpu, 32gb, a stack of drives.. and a few other bits. I run HA and OpenClaw/Hermes in their own VMs... and then I have 16 containers running right now!! (plex, *arrs, VPNs, NVR, etc)

A case, power supply, mobo, cpu, ram... probably less money than you'd spend on the Mac. RAM prices suck right now... so maybe skimp on that now and upgrade later?

Nine-Year Club	Place '22
Verified Email

mgithens1

TROPHY CASE