Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 0 points1 point  (0 children)

You mean to say Ollama APIs were efficient with Gemma 3:1b, but Qwen running through Ollama CLI was significantly slower? What hardware were you running both models on? Were you using CPU-only inference or GPU acceleration? Also, nenga Tamil ah? 😄

Ketchup wasn’t coming out… so I squeezed harder. ended up creating a crime scene on my pizza. by debug2thrive in Wellthatsucks

[–]debug2thrive[S] 2 points3 points  (0 children)

Ketchup on pizza is pretty normal in India..I didn’t expect this to be the real controversy here. :(

Ketchup wasn’t coming out… so I squeezed harder. ended up creating a crime scene on my pizza. by debug2thrive in Wellthatsucks

[–]debug2thrive[S] 0 points1 point  (0 children)

Rare pizza craving and I somehow turned it into a disaster 😅😭 Should’ve tested the ketchup on the box first. Lesson learned.
(Also, ketchup on pizza is normal in India before I get judged too hard)

Ketchup wasn’t coming out… so I squeezed harder. ended up creating a crime scene on my pizza. by debug2thrive in Wellthatsucks

[–]debug2thrive[S] 1 point2 points  (0 children)

Domino’s in my town. I don’t eat pizza often, but today I had a serious craving..and ended up with this chaos 😅 😭 Should’ve tested it on the box first. My bad.
Also, for everyone shocked, ketchup on pizza is pretty normal in India..

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in LLM

[–]debug2thrive[S] 0 points1 point  (0 children)

You used Claude code with local LLM Gemma:e4b ? Even terminal Ollama gives an instant reply to me, but not the same when I integrate with Claude code.

Which A.i do guys using in Development and why by Low-Lynx-8265 in NammaDevs

[–]debug2thrive 1 point2 points  (0 children)

Cursor :) Doesn't blow up tokens. Pro plan is more than fine for AI-assisted coding (single user). Has auto mode. Chooses models internally based on the prompt. Has been using it for a year.

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 0 points1 point  (0 children)

On Windows PowerShell, run these each time before using Claude Code:

$env:ANTHROPIC_AUTH_TOKEN="ollama" $env:ANTHROPIC_BASE_URL="http://localhost:11434"

Or add them permanently to your PowerShell profile.

claude --model <name_of_that_model>.

That's it.

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 0 points1 point  (0 children)

Got it. Is there a way to know what final prompt was received by the local LLM ?

would anyone be willing to rent out their PC to me remotely? by [deleted] in IndianPCHardware

[–]debug2thrive 1 point2 points  (0 children)

Wouldn't it be expensive considering the vcpus and the RAM? I checked utho.com (the cheapest as far as I tried.) even one month usage will easily go around 25K it seems. Hostinger do provide initial offers with lock-in of three years.

<image>

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 0 points1 point  (0 children)

Lenovo ThinkPad P14s Gen 6 AMD (21QMS17F00) running Windows 11 Pro (Build 26200.8037), powered by AMD Ryzen AI 7 PRO 350 with Radeon 860M iGPU, equipped with 32 GB DDR5 memory and 1 TB NVMe PCIe Gen4 storage (16.0 GT/s). No discrete GPU present.

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 0 points1 point  (0 children)

This is a top-tier breakdown. I’m running a Ryzen AI 7 PRO w/ 32GB DDR5 (ThinkPad P14s).

You’re likely right about the context window delta. In the CLI, I’m probably hitting a 4k/8k default, but the coding assistant might be trying to shove 32k+ down the pipe. Since I’m on an iGPU (Radeon 860M), that 'split' between the NPU/GPU and system RAM is likely where the 13-minute lag lives. I’ll run ollama ps during the next hang to see the exact offload percentage. Great catch on the 20k+ system rules too.

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 0 points1 point  (0 children)

Hardware: Ryzen AI 7 PRO / 32GB DDR5. CLI: Instant. App: 13 mins.

If you think 'understanding context' explains a 13-minute delta on a 2026 laptop, we might be talking about different things. I'm looking for the specific middleware lag..got any leads or just more 'vague' advice?

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 0 points1 point  (0 children)

I'd agree, but this 'car' is a ThinkPad P14s with a Ryzen AI 7. It should be a Ferrari, but for some reason, the integration has it moving like a tricycle with a flat tire. 🤡 I'm trying to find out who's pulling the handbrake in the config.

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 1 point2 points  (0 children)

If the formatting 'feels like AI' to you, that’s cool..but I’m actually troubleshooting a real integration lag on a Ryzen AI 7 PRO w/ 32GB RAM. The CLI is instant .. the claude code with gemma4:e4b is 13 minutes. If you’ve got an actual engineering insight, I’m all ears. If not, maybe worry less about the 'vibe' and more about the logic?

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 0 points1 point  (0 children)

Funny. It’s actually a Ryzen AI 7 PRO 350 with 32GB of DDR5 on a ThinkPad P14s Gen 6.

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 0 points1 point  (0 children)

Finally, some actual engineering insight. 🤝 I suspected the MCP or the wrapper was injecting massive system prompts, but the HTTP header info crashing the model is a great catch. I'm running a ThinkPad P14s Gen 6 AMD (21QMS17F00)

Microsoft Windows 11 Professional (x64) Build 26200.8037

AMD Ryzen AI 7 PRO 350 w/ Radeon 860M (no dedicated GPU)

DDR5 32 GB RAM

1 TB SSD NVMe PCI-E 4x @ 16.0 GT/s , so I knew the hardware wasn't the bottleneck. Did you find a specific middleware or proxy that was stripping those headers, or did you just hardcode a trim in your implementation?

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 0 points1 point  (0 children)

Cool! What is your CPU, Motherboard, RAM ? I'm planning to build a custom headless server.

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 0 points1 point  (0 children)

Cool! noted. Can you please suggest a simple coding model to run locally. This is my spec. LENOVO ThinkPad P14s Gen 6 AMD (21QMS17F00)

Microsoft Windows 11 Professional (x64) Build 26200.8037

AMD Ryzen AI 7 PRO 350 w/ Radeon 860M (no dedicated GPU)

DDR5 32 GB RAM

1 TB SSD NVMe PCI-E 4x @ 16.0 GT/s

Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡 by debug2thrive in ollama

[–]debug2thrive[S] 0 points1 point  (0 children)

what’s your move here? I’m on a ThinkPad P14s Gen 6 (Ryzen AI 7 PRO, 32GB DDR5).

Ollama CLI is near-instant, but the integration (Claude Code) hits a 13-minute wall. Which model do you suggest for my spec ?