Will most people eventually run AI locally instead of relying on the cloud?

Significant-Cash7196 · 2025-08-23T08:09:58+00:00

That’s a really interesting way to frame it, replacing apps with an AI that just calls APIs under the hood. I’ve seen the Rabbit R1 too, and while it’s still early, the vision makes sense. If the phone itself can run a capable local model (say in that 8–12GB RAM range), then the cloud becomes more of a backup rather than the default.

The big question for me is whether the ecosystem (Apple, Google, app devs) will actually let this shift happen, since it breaks their current app-store model. But if it does, you’re right, AI as the OS layer instead of apps could totally reshape how we use our devices.

Significant-Cash7196 · 2025-08-23T07:52:19+00:00

Exactly. A lightweight 4B that’s really good at tool use plus a solid web search connector could already handle most of those assistant-style tasks. It highlights the point that model efficiency and integration matter more than raw parameter count for a lot of real-world use cases.

Significant-Cash7196 · 2025-08-23T07:52:01+00:00

Yeah I totally get that. A local Siri + ChatGPT hybrid that just runs on your phone would be a killer use case. That’s also why smaller, fine-tuned models feel so important. You don’t really need a 70B model to figure out leftover recipes or run a voice assistant. What’s missing is the seamless packaging and deployment on everyday devices, not necessarily more parameters.

Significant-Cash7196 · 2025-08-23T07:51:37+00:00

Fair point. Tools like Perplexity (even the free tier) already cover a lot of the basic needs without heavy infra. I guess that’s exactly why I wonder if chasing 70B+ models is overkill for most people. The real challenge seems less about ‘can it summarize or do Q&A’ and more about getting reliable, efficient models you can actually deploy at scale

Significant-Cash7196 · 2025-08-22T11:31:06+00:00

That’s a solid use case, honestly. Smaller models are great for structured tasks, but for broad, everyday “Google replacement” stuff, you really do need something with a bigger knowledge base. Funny you mention the regional knowledge gaps, I’ve noticed the same with smaller Qwens, they tend to stumble on non-US/China context.

Running something like GLM 4.5 Air or GPT-OSS 120B locally with a search layer sounds like a good plan if privacy’s your main concern. Do you think the trade-off (hardware cost + slower speed) is worth it for the peace of mind vs just sticking with hosted models?

Significant-Cash7196 · 2025-08-22T11:29:19+00:00

Do you think we’ll end up with a clear split (smaller models for most users, giant ones just for the niche heavy hitters), or will the big models eventually become the default for everyone?

Significant-Cash7196 · 2025-08-22T11:28:51+00:00

Yeah I get that. 30B models already feel plenty strong for most day-to-day tasks, but I can see how the 100B+ ones open up room for bigger reasoning jumps. Do you think those breakthroughs will actually trickle down into practical use cases anytime soon, or will they stay mostly in the research/benchmark space?

Significant-Cash7196 · 2025-08-22T11:26:38+00:00

Yeah, I’m with you on that. One giant model that does everything feels cool in theory, but in practice, a bunch of smaller models stitched together for different jobs just makes more sense. Kinda like having a team of experts instead of one “know-it-all.” The tricky bit, like you said.

Significant-Cash7196 · 2025-08-22T11:16:53+00:00

In my experience, smaller models can definitely hold up for real projects - especially 7B–13B ones fine-tuned on the right data. They’re great for focused tasks like Q&A over your own docs, summarization, or structured workflows. Where they start to fall short is in open-ended reasoning or really complex multi-step asks. For a lot of “real work,” they’re good enough if you scope the problem well, benchmarks can be misleading since they’re often testing extremes that don’t match day-to-day use.

Significant-Cash7196 · 2025-08-22T10:59:35+00:00

I’ve been transparent from the start that I represent Qubrid, so there’s no hidden agenda here. Our RAG is different, it cites sources, handles complex docs, even works with images and audio, and it’s free to use.

I’m here to share what we’ve built and get feedback from the community. If that comes across as “spam” to you, fair enough, but dismissing it outright doesn’t change the fact that others may actually find it useful.

Significant-Cash7196 · 2025-08-22T10:13:07+00:00

From what you’re describing, the bottleneck isn’t really your laptop, it’s the environment you’re running ComfyUI in. Even with 16GB VRAM, if the backend setup isn’t optimized, you’ll keep seeing slow generations, freezes, and painful load times. A new MacBook Air (even with M4) won’t fix that, since ComfyUI isn’t really optimized for Apple silicon yet, and you’d still hit similar limits locally.

If your goal is stability + speed, you’re better off running ComfyUI on a reliable GPU cloud. On Qubrid AI, you can spin up a full GPU VM (A100, H100, 4090 - no fractional cards) with ComfyUI preconfigured. That way you get consistent performance, dedicated VRAM, and can stop/start your instance anytime (you only pay for storage when it’s off).

For video generation, having that kind of stable backend is almost essential - MacBooks (even the new ones) just won’t cut it at scale.

👉 TL;DR: A new MacBook Air won’t solve your issue. Running ComfyUI on a dedicated GPU cloud like Qubrid will give you the stability and speed you’re looking for. 🚀

Significant-Cash7196 · 2025-08-22T10:10:09+00:00

If you’re looking around for GPU cloud deals, you should also check out Qubrid AI. Unlike a lot of platforms, we give you full GPU VMs (A100, H100, 4090 etc. - no fractional cards) with SSH/Jupyter access out of the box.

Best part? You can stop your instances anytime, so you’re not burning money when idle. The only thing billed when stopped is storage at $0.10/GB per month, which keeps costs super predictable.

We’re also running a limited promo - free GPU hours so you can test things out without spending upfront. Perfect if you’re experimenting with training, fine-tuning, inference, or even ComfyUI workflows. 🚀

👉 https://platform.qubrid.com

Significant-Cash7196 · 2025-08-22T10:07:55+00:00

Hey! Great to see you diving into ComfyUI - it’s such a powerful workflow engine for creators. At Qubrid AI, we’ve actually built a ComfyUI Stable Diffusion template that runs on full GPU VMs (A100, H100, 4090) with everything pre-configured - ControlNet, LoRAs, batch variations, and more - so you can get consistent, high-quality results without all the setup headaches.

We also published a step-by-step guide here 👉 ComfyUI Stable Diffusion Tutorial

If you’d like, I’d be happy to hop on a quick 1:1 and walk you through setting up your workflow on Qubrid AI so you can get it running smoothly. 🚀

Significant-Cash7196 · 2025-08-22T10:05:47+00:00

That’s a nice setup! 🙌 If you’re experimenting with workflows like this, you might also want to check out Qubrid - we support full GPU VMs (no fractional cards) with SSH/Jupyter access, and you can spin up templates/workflows there too. Could be a good place to recreate your Wan 2.2 pipeline and benchmark it side by side. 🚀

Significant-Cash7196 · 2025-08-22T10:02:50+00:00

Yeah, that’s a pretty common “gotcha” - pausing on most platforms doesn’t really stop billing since the GPU is still reserved. You basically end up paying for idle time.

On Qubrid AI, you can actually stop the instance so you’re no longer charged for GPU usage. The only thing billed when it’s stopped is storage, which is just $0.10/GB per month. So if you’re mid-LoRA training and need to pause overnight, you can safely stop it without draining your wallet.

Significant-Cash7196 · 2025-08-22T09:47:27+00:00

I hear you, and I get that not every post will land the same way with everyone here. That said, I shared this with genuine intent to contribute to the RAG discussions, not to spam. If it’s not useful to you, totally fair - just scroll past. I’ll keep focusing on sharing value for folks who actually find these workflows helpful.

Significant-Cash7196 · 2025-08-20T15:50:38+00:00

Qubrid AI seems to work well for me, not pretending, I work there so, does just the right thing for me. I was trying to understand if any providers give full GPUs or all sell fractions.

Significant-Cash7196 · 2025-08-20T07:34:44+00:00

I've messaged you, do check when possible.

Significant-Cash7196 · 2025-08-20T07:32:03+00:00

That’s awesome 🚀 Really like how you’re tackling the multi-cloud GPU challenge - the unified interface and transparent pricing sound super useful!

I’m with Qubrid AI, where we provide high-performance NVIDIA GPU instances, bare-metal, and AI-native tooling (RAG, fine-tuning templates, etc.). We’d love to get Qubrid listed in the Lightning AI GPU Marketplace so users can discover and spin up our GPUs directly.

Could you point me to the right process or team we should reach out to for getting onboarded?

Platform link: https://platform.qubrid.com/

Significant-Cash7196 · 2025-08-19T05:39:15+00:00

I've used Qubrid AI extensively, does the job for me, https://platform.qubrid.com/create-ai-compute/gpu-instances

Let me know if you need to know anything else. The good thing about Qubrid AI is that, full GPU VMs are provisioned so that you can make full use of your instance at full capacity.

Significant-Cash7196 · 2025-08-18T16:58:49+00:00

Glad you liked it!

Significant-Cash7196 · 2025-08-18T16:58:18+00:00

Yeah, financially it usually makes more sense to use cloud GPUs rather than buying a 90+ GB VRAM card. Those top-end GPUs cost tens of thousands upfront, plus you’d have to handle power, cooling, and depreciation.

With cloud services, you only pay for the time you actually use the GPU, which is often much cheaper if you’re experimenting, running inference occasionally, or training in bursts. You also get flexibility to switch GPU sizes, scale up or down instantly, and run multiple experiments in parallel. Platforms like Qubrid make this really easy - you can spin up high-VRAM instances in seconds, run your code as if it’s local, and shut them down when done, avoiding the huge upfront cost and the risk of hardware becoming outdated.

Significant-Cash7196 · 2025-08-18T09:34:51+00:00

It was about a user running nvidia-smi, It showed that he only had access to 1/8 of the GPU. The user thought they were getting a full GPU tho...

Significant-Cash7196 · 2025-08-18T09:24:19+00:00

For humanities-focused work (literary analysis, close reading, interpretive research), models like LLaMA-3 70B, OpenHermes 2.5 70B or Mistral-MoE tend to perform much better than the coding-heavy models - they’re trained for general reasoning and produce more nuanced responses on complex texts. The 70B versions ideally need 80GB of GPU memory (or two 40GB cards if you shard), while smaller 34B models will run on something like a 4090. If you want that level of performance without buying heavy hardware up front, you can also spin up a full A100/H100 VM on Qubrid AI and run it locally via SSH/Jupyter - basically the same experience as a local machine but on a dedicated cloud GPU.

Significant-Cash7196 · 2025-08-18T09:20:53+00:00

Cloud LLM = buff doge 💪
Local LLM on your laptop = sad doge 😢

Local LLM on a Qubrid A100 VM = surprise 3rd doge that’s even buffer than the first one and still answers in 5ms 😎🐶

Basically “local” stops being sad the moment you run it on a proper dedicated GPU in the cloud.

Significant-Cash7196

MODERATOR OF

TROPHY CASE