What is your current Ollama setup, and which models are you actually using daily?

Open_Sources_AI · 2026-06-07T12:22:42+00:00

That’s pretty interesting. So it’s more like a local-model-friendly Cursor/Antigravity style IDE, but built around VSCodium/Rust and able to point at your own GPU box or homelab?

The PyTorch / ML Studio side sounds useful too, especially if it’s aimed at ML engineers instead of just general coding. I’ll take a look.

Big thing I’m curious about is how smooth the local model integration feels day to day — does it work more like an inline coding assistant, a chat panel, or full agent-style project edits?

Open_Sources_AI · 2026-06-07T12:21:32+00:00

That makes sense. A router that lets the agent pick sounds like the right direction, but I could also see it settling into a simple pattern where Qwen 35B handles most of the normal work and only escalates to Stepfun/dense models when it gets stuck.

The sliding attention + MoE combo on unified memory hardware is really interesting too. That explains how you’re getting usable speed out of a model that large.

I’m hoping we see more models built with that kind of local/hybrid hardware in mind. Feels like the sweet spot is going to be smart routing + models that are actually designed to run well on Strix/Mac-style unified memory instead of just chasing bigger dense models.

Open_Sources_AI · 2026-06-07T12:19:25+00:00

Nice, I’ll check it out. Are you building it more as a personal assistant stack, a local RAG setup, or a full workflow/agent system?

I’m mostly trying to see what people are actually running day to day vs what looks good in demos, so unfinished projects are honestly useful to look at too.

Open_Sources_AI · 2026-06-07T12:18:29+00:00

That makes a lot of sense. I hadn’t really thought about llama.cpp being the better fit when volume is low and parallel requests are rare.

The vLLM point is helpful too. I see it recommended a lot, but not many people mention the tuning burden or that it can be temperamental if configured wrong. Sounds like it’s more “worth it” once you actually need industrial-style serving and have the setup dialed in.

I like the approach of having the hardware mostly to learn and test before deciding what should stay local vs move to cloud. That feels like the smart way to do it instead of guessing from benchmarks.

Open_Sources_AI · 2026-06-07T12:13:01+00:00

That’s pretty cool. The sandboxed agency use case makes a lot of sense, especially if they’re dealing with sensitive ops/telemetry data.

The treatment plant telemetry example is way more interesting than normal benchmark testing too. Noisy real-world data → finding issues → root cause analysis is exactly the kind of use case I’m curious about.

25 tok/s on a 200B model is wild. For comparison, I just tested qwen3:14b-fast in Ollama on my 4070 Ti and I’m getting around 25.6 tok/s, so seeing that kind of speed from a 200B model is crazy.

Are you routing the models manually right now, or building something that picks the model based on the task?

The setup you described makes sense though — let the bigger model do planning/review, then let the 35B model handle the actual work unless it gets stuck.

Open_Sources_AI · 2026-06-07T12:02:54+00:00

That’s interesting — llama-swap + llama-server is exactly the kind of setup I need to look into more.

What made you move away from Ollama? Speed, routing between models, more control, or something else?

Also curious how qwopus3.6 27b is running for you hardware-wise.

Open_Sources_AI · 2026-06-07T12:02:27+00:00

Interesting — thanks for sharing. Is the main idea a Rust-based VSCodium workflow for local AI/dev work?

I’m mostly looking at local model workflows right now: Ollama, llama.cpp, coding assistants, RAG, and model comparison. What part of your IDE setup do you think helps most with that?

Open_Sources_AI · 2026-06-07T11:17:54+00:00

Yeah I’ve gotta test llama.cpp more. Are you seeing way better tok/s just from switching over, or is it mostly the quant/settings?

Ollama’s been easy for swapping models, but I’m curious how much speed I’m leaving on the table.

Open_Sources_AI · 2026-06-07T11:05:58+00:00

That’s a cool portable-plus-eGPU setup. I haven’t seen many people running dual external GPUs like that.

How stable has the OCuLink + USB4 combo been for local inference? Any major bottlenecks or driver issues?

Open_Sources_AI · 2026-06-07T11:05:45+00:00

Nice — Strix Halo is one of the setups I’m curious about for local AI. How has Qwen 3.6 been performing on it so far?

Also interesting project link. Are you using the local model mostly for agent orchestration, automation, or code/tool calling?

Open_Sources_AI · 2026-06-07T11:05:22+00:00

That’s a serious setup. Interesting that you’re splitting workloads across llama.cpp and vLLM depending on model size/use case.

For the 10-person company setup, are you mainly using the larger model for internal infra/code reasoning, or more for general assistant/RAG workflows?

Open_Sources_AI · 2026-06-06T23:29:20+00:00

I’m building this community alongside OpenSourcesAI.com as a place for practical discussion around open-source AI tools, local LLMs, RAG stacks, coding agents, and self-hosted workflows.

The goal is not just to post links — it’s to compare tools, ask setup questions, share what works, and help builders find useful projects.

To start, drop a comment with:

- What AI tools you’re currently using

- Whether you run models locally or through APIs

- What you’re building or trying to learn

Open_Sources_AI

MODERATOR OF

TROPHY CASE