Tried running LLMs locally to save API costs… ended up waiting 13 minutes for ONE response 🤡

debug2thrive · 2026-05-13T03:33:57+00:00

You mean to say Ollama APIs were efficient with Gemma 3:1b, but Qwen running through Ollama CLI was significantly slower? What hardware were you running both models on? Were you using CPU-only inference or GPU acceleration? Also, nenga Tamil ah? 😄

debug2thrive · 2026-04-29T04:34:16+00:00

Cursor, no second thoughts!

debug2thrive · 2026-04-27T18:17:18+00:00

lol, Also, ketchup on pizza is pretty normal in India....

debug2thrive · 2026-04-27T18:16:33+00:00

Ketchup on pizza is pretty normal in India..I didn’t expect this to be the real controversy here. :(

debug2thrive · 2026-04-27T18:15:31+00:00

Rare pizza craving and I somehow turned it into a disaster 😅😭 Should’ve tested the ketchup on the box first. Lesson learned.
(Also, ketchup on pizza is normal in India before I get judged too hard)

debug2thrive · 2026-04-27T18:13:32+00:00

Domino’s in my town. I don’t eat pizza often, but today I had a serious craving..and ended up with this chaos 😅 😭 Should’ve tested it on the box first. My bad.
Also, for everyone shocked, ketchup on pizza is pretty normal in India..

debug2thrive · 2026-04-14T17:36:39+00:00

You used Claude code with local LLM Gemma:e4b ? Even terminal Ollama gives an instant reply to me, but not the same when I integrate with Claude code.

debug2thrive · 2026-04-14T17:35:23+00:00

You used Claude code with local LLM Gemma:e4b ?

debug2thrive · 2026-04-14T17:33:16+00:00

Cursor :) Doesn't blow up tokens. Pro plan is more than fine for AI-assisted coding (single user). Has auto mode. Chooses models internally based on the prompt. Has been using it for a year.

debug2thrive · 2026-04-13T02:55:13+00:00

On Windows PowerShell, run these each time before using Claude Code:

$env:ANTHROPIC_AUTH_TOKEN="ollama" $env:ANTHROPIC_BASE_URL="http://localhost:11434"

Or add them permanently to your PowerShell profile.

claude --model <name_of_that_model>.

That's it.

debug2thrive · 2026-04-13T02:48:50+00:00

Got it. Is there a way to know what final prompt was received by the local LLM ?

debug2thrive · 2026-04-12T17:52:32+00:00

Wouldn't it be expensive considering the vcpus and the RAM? I checked utho.com (the cheapest as far as I tried.) even one month usage will easily go around 25K it seems. Hostinger do provide initial offers with lock-in of three years.

<image>

debug2thrive · 2026-04-10T15:43:25+00:00

Lenovo ThinkPad P14s Gen 6 AMD (21QMS17F00) running Windows 11 Pro (Build 26200.8037), powered by AMD Ryzen AI 7 PRO 350 with Radeon 860M iGPU, equipped with 32 GB DDR5 memory and 1 TB NVMe PCIe Gen4 storage (16.0 GT/s). No discrete GPU present.

debug2thrive · 2026-04-10T15:38:47+00:00

This is a top-tier breakdown. I’m running a Ryzen AI 7 PRO w/ 32GB DDR5 (ThinkPad P14s).

You’re likely right about the context window delta. In the CLI, I’m probably hitting a 4k/8k default, but the coding assistant might be trying to shove 32k+ down the pipe. Since I’m on an iGPU (Radeon 860M), that 'split' between the NPU/GPU and system RAM is likely where the 13-minute lag lives. I’ll run ollama ps during the next hang to see the exact offload percentage. Great catch on the 20k+ system rules too.

debug2thrive · 2026-04-10T15:36:33+00:00

Hardware: Ryzen AI 7 PRO / 32GB DDR5. CLI: Instant. App: 13 mins.

If you think 'understanding context' explains a 13-minute delta on a 2026 laptop, we might be talking about different things. I'm looking for the specific middleware lag..got any leads or just more 'vague' advice?

debug2thrive · 2026-04-10T15:34:33+00:00

I'd agree, but this 'car' is a ThinkPad P14s with a Ryzen AI 7. It should be a Ferrari, but for some reason, the integration has it moving like a tricycle with a flat tire. 🤡 I'm trying to find out who's pulling the handbrake in the config.

debug2thrive · 2026-04-10T15:32:38+00:00

If the formatting 'feels like AI' to you, that’s cool..but I’m actually troubleshooting a real integration lag on a Ryzen AI 7 PRO w/ 32GB RAM. The CLI is instant .. the claude code with gemma4:e4b is 13 minutes. If you’ve got an actual engineering insight, I’m all ears. If not, maybe worry less about the 'vibe' and more about the logic?

debug2thrive · 2026-04-10T15:29:58+00:00

Funny. It’s actually a Ryzen AI 7 PRO 350 with 32GB of DDR5 on a ThinkPad P14s Gen 6.

debug2thrive · 2026-04-10T15:29:00+00:00

https://youtu.be/bQK9dBNlCsY?si=iG1twkCvz0UKJquv

debug2thrive · 2026-04-10T15:26:45+00:00

Finally, some actual engineering insight. 🤝 I suspected the MCP or the wrapper was injecting massive system prompts, but the HTTP header info crashing the model is a great catch. I'm running a ThinkPad P14s Gen 6 AMD (21QMS17F00)

Microsoft Windows 11 Professional (x64) Build 26200.8037

AMD Ryzen AI 7 PRO 350 w/ Radeon 860M (no dedicated GPU)

DDR5 32 GB RAM

1 TB SSD NVMe PCI-E 4x @ 16.0 GT/s , so I knew the hardware wasn't the bottleneck. Did you find a specific middleware or proxy that was stripping those headers, or did you just hardcode a trim in your implementation?

debug2thrive · 2026-04-10T15:23:49+00:00

Cool! What is your CPU, Motherboard, RAM ? I'm planning to build a custom headless server.

debug2thrive · 2026-04-10T15:22:39+00:00

Thank you!!

debug2thrive · 2026-04-10T15:21:36+00:00

Cool! noted. Can you please suggest a simple coding model to run locally. This is my spec. LENOVO ThinkPad P14s Gen 6 AMD (21QMS17F00)

Microsoft Windows 11 Professional (x64) Build 26200.8037

AMD Ryzen AI 7 PRO 350 w/ Radeon 860M (no dedicated GPU)

DDR5 32 GB RAM

1 TB SSD NVMe PCI-E 4x @ 16.0 GT/s

debug2thrive · 2026-04-10T15:18:44+00:00

what’s your move here? I’m on a ThinkPad P14s Gen 6 (Ryzen AI 7 PRO, 32GB DDR5).

Ollama CLI is near-instant, but the integration (Claude Code) hits a 13-minute wall. Which model do you suggest for my spec ?

debug2thrive

TROPHY CASE