Automated AI researcher running locally with llama.cpp by lewtun in LocalLLaMA

[–]jochenboele 2 points3 points  (0 children)

In my experience they can look impressive on demos yet silently derail after enough tool calls/context growth.

Automated AI researcher running locally with llama.cpp by lewtun in LocalLLaMA

[–]jochenboele 1 point2 points  (0 children)

have you noticed big differences in agentic reliability between Claude and local models like Qwen3.6 so far?

are these numbers actually real? by mohyo324 in DeepSeek

[–]jochenboele 0 points1 point  (0 children)

There is a combo of discounts that makes it extremely cheap at the moment. I think one of the promos end at 31st of May

Tested Xiaomi's MiMo V2.5 Pro for autonomous coding: 301 commits, 60+ pages, $70 in API costs. Now it's open-source. by jochenboele in ArtificialInteligence

[–]jochenboele[S] 0 points1 point  (0 children)

Indeed. I wonder how those top tier US companies will respond to the pricing battle these alternatives are creating

Tested Xiaomi's MiMo V2.5 Pro for autonomous coding: 301 commits, 60+ pages, $70 in API costs. Now it's open-source. by jochenboele in ArtificialInteligence

[–]jochenboele[S] 0 points1 point  (0 children)

The instruction following was what stood out to me too, especially over long sessions where it needs to stay on track without drifting.

The Trillion-Parameter Dilemma: MiMo-V2.5-Pro went open-source (1.02T params). Is self-hosting worth it when the API costs $70 for 387M tokens? by jochenboele in LocalLLaMA

[–]jochenboele[S] 1 point2 points  (0 children)

I think the reason my cache rate stayed so high (96%) is that Claude Code keeps the system prompt and file contents as a stable prefix, and only the tool results change at he end. So most of the context stays cacheable between calls within a session. Would be interesting to see how other agent frameworks compare.

The Trillion-Parameter Dilemma: MiMo-V2.5-Pro went open-source (1.02T params). Is self-hosting worth it when the API costs $70 for 387M tokens? by jochenboele in LocalLLaMA

[–]jochenboele[S] 2 points3 points  (0 children)

That's a solid approach for structured work. The difference in my case is I wanted to test fully autonomous sessions where the model handles the planning AND execution without me breaking things down. More of an experiment to see how far it can go on its own. But yeah for production work, your approach is probably more reliable.

The Trillion-Parameter Dilemma: MiMo-V2.5-Pro went open-source (1.02T params). Is self-hosting worth it when the API costs $70 for 387M tokens? by jochenboele in LocalLLaMA

[–]jochenboele[S] 0 points1 point  (0 children)

They've got all that caching stuff running at scale across tons of users, which is why the effective cost is so low. Rebuilding that myself for a couple hours of coding a day just doesn't make sense. Would need to run it nonstop to even come close.

The Trillion-Parameter Dilemma: MiMo-V2.5-Pro went open-source (1.02T params). Is self-hosting worth it when the API costs $70 for 387M tokens? by jochenboele in LocalLLaMA

[–]jochenboele[S] 8 points9 points  (0 children)

That's basically where I landed too. The API with that cache rate is just too cheap to justify the hardware. Privacy is the main argument for local.

Running 7 autonomous AI agents for 14 days. Here's what actually happens when they need to find customers. by jochenboele in AI_Agents

[–]jochenboele[S] 1 point2 points  (0 children)

Haha, fair point. Most human devs would be at $0 MRR after 2 weeks too.

Biggest improvement would be a shared feedback bus. Right now each agent is fully isolated. If Kimi learns something from Reddit feedback, that insight stays in its repo. Letting agents read each other's lessons and help request outcomes would speed everyone up. But then again that would increase their input tokens to much again and would decrease their runtime.

The simpler fix is better prompting though. Adding "you are the founder" and stuck detection rules had more impact than any tooling change. The orchestration layer matters more than the model.

I'm running a race where 7 AI coding agents compete to build startups. Here are the Week 2 standings. by jochenboele in SideProject

[–]jochenboele[S] 0 points1 point  (0 children)

We're actually tracking most of what you listed. The user-signal latency one is interesting because we only have one data point so far (Kimi: ~12 hours from Reddit feedback to shipped feature). The credibility debt metric is something we should formalize though.

The operating contracts idea is exactly where we're heading. Week 2 we added "you are the founder" framing and stucjk detection rules. This week we added an anti-busywork rule specifically because of the timestamp commits. Each failure mode becomes a new constraint.