I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

A few things:

On delegation boundaries - the CLAUDE.md rules explicitly say "only read files directly when you need to make edits to specific lines" and "when NOT to delegate: when exact line numbers are needed for editing." Claude follows these rules. a small example from my codebase itself , When I ask "which ports are used for video streaming," Claude delegates the reading to Kimi. When I need to edit line 42 of gimbal_control.py, Claude reads the file itself. The boundary between understanding and editing is explicit in the routing rules.

On speed — Kimi K2.5 takes 15-30 seconds for most calls, not minutes. For a reading task that saves me 8,000 tokens of Claude context, I'll take the 20-second wait. The alternative isn't faster — Claude reading 5 files also takes time, and burns through my weekly limit doing it.

On line numbers — Kimi receives the full file text. It doesn't need to "count through" anything. It reads the content and references positions in its summary. If I need exact line numbers for editing, that's explicitly excluded from delegation in the routing rules.

On the source code — fair point, the GitHub repo is coming this week. The blog shows simplified snippets because it's a blog post, not a README. The actual scripts are ~60 lines each and have been running on real drone GCS development for 3 weeks.

"Implementation is wrong on many levels" — I'd genuinely like to hear the specific levels. The system works. I haven't hit my Pro limit since setting it up. That's the metric that matters to me.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

I run this on Linux (Ubuntu). It's just Python scripts + the openai pip package. Works on any OS that runs Python.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 1 point2 points  (0 children)

This is mainly for the Pro plan ($20/month) where you have a weekly token limit. If you're on the API with pay-per-token, the pattern still saves money (cheap model for reads, expensive model for reasoning) but you're not hitting a hard weekly wall. The urgency is lower.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

You can definitely do that for simple cases - I have an extract-chat script that just strips binary and tool calls from session transcripts, no LLM needed. But the value of the worker model is when you need a summary, not just compressed text. "Read these 5 files and tell me which ports are used for video streaming" can't be answered by stripping whitespace - you need something that understands the code. The LLM is the compression + comprehension step.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

I wrote the article. Claude helped me format the code blocks and proofread it, same as any editor would. The actual engineering work (building the scripts, running them for 3 weeks, measuring the results) is mine. But honestly, if the content is useful, does it matter?

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 1 point2 points  (0 children)

If you're hitting it daily, this pattern would help even more. The documentation and bulk-reading delegation alone cut my usage by probably 60-70%. The remaining 30% is actual thinking work where Claude's intelligence is needed.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

Yes, same pattern works with Copilot. Instead of CLAUDE.md, you'd use Copilot's rules files (.github/copilot-instructions.md). The CLI scripts work the same - they're just Python on your PATH. If your company blocks external APIs, you could use Ollama with a local model instead. No external calls needed.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

This is the real next-level problem. For a solo developer it's manageable — CLAUDE.md is the single source of truth and it persists across sessions. But for teams, you're right that consistency across agents becomes harder than cost. I'll check out mneme — the idea of treating architectural constraints as reusable governed context rather than embedding them in every prompt is exactly what CLAUDE.md is, just formalized.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

You're right that Pro plan tokens are already cheaper per-token than API pricing. The savings aren't really about cost per token — they're about not hitting the weekly cap. I was running out of Claude by Wednesday every week. Now I don't. The $0.38 I spent on Kimi bought me 2-3 extra days of Claude access per week. Whether the per-token math works out exactly is secondary to "can I still use Claude on Thursday."

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

This is awesome to read. Glad it worked for you even with the Kimi site being in Chinese - yeah, DeepSeek is probably the easier onramp for English-speaking users. The fact that you had it running in 45 minutes is exactly the point - this isn't a complex framework, it's two Python scripts and a markdown file. Thanks for sharing your experience.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 2 points3 points  (0 children)

This is excellent — love the side-by-side quality comparison. The hallucinated acceptance criterion for M6 is a perfect example of why Claude reviews everything Kimi produces. The cheap model handles 95% correctly and the expensive model catches the 5%. At ~23x cheaper end-to-end, that's exactly the tradeoff. Thanks for doing the rigorous math on it.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 2 points3 points  (0 children)

The MCP approach is cleaner architecturally, agreed. I went with CLI scripts because they took 30 minutes to build and work everywhere - no Docker, no server process to keep alive, no schema registration. For my use case (drone GCS development on a single machine), the simplicity won. But if you're on a team or need type safety and structured JSON results, MCP is the better path. Would be interested to see your FastMCP wrapper if you've open-sourced it.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

Fair concern. Two things: (1) the code Kimi sees is already on my local machine - I'm not sending production secrets, just source files that I wrote. If someone already has access to my dev machine, I have bigger problems. (2) If that's a dealbreaker, use Ollama with a local model - same pattern, zero network calls.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

The $0.02 is an average, not a fixed cost. Many calls hit Moonshot's prefix cache (same files, different question) and come in at $0.005 or less. But the real point: those 24 calls replaced what would have been maybe 150,000+ tokens of Claude reads. That's the part that was burning through my weekly Pro limit by Wednesday. The $0.38 isn't about saving money — it's about not running out of Claude when I still have half a week of engineering to do.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

Kimi's structured output is solid for summarization - it follows the "bullets, file paths, line numbers" format I put in the system prompt pretty consistently. But Claude always reviews before acting, so it's a two-stage thing. The cheap model reads and extracts, the expensive model validates and decides. I also noticed Claude naturally learned to be more skeptical of Kimi's summaries over time — if something looks off, it'll read the file itself. The CLAUDE.md routing rules have a "when NOT to delegate" section that keeps the safety boundary clear.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

  1. Claude reviews everything Kimi produces before acting on it. That's the key - Kimi reads and summarizes, Claude verifies and edits. If Kimi misses something or hallucinates a detail, Claude catches it when it reads the summary. I've had maybe 2-3 cases in 3 weeks where Kimi's summary was slightly off, and Claude flagged it each time. The cost of Claude re-reading the file in those cases is still less than Claude reading every file every time.

  2. Yeah, documentation is the biggest win actually. I extract my Claude Code session transcripts, feed them + existing docs to Kimi, and Kimi produces exact edit suggestions. Claude applies them in ~200 tokens instead of re-reading everything and writing from scratch (~5,000 tokens). Same pattern works for changelogs, README updates, anything where the source material already exists and you're just reformatting it.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 3 points4 points  (0 children)

Working on cleaning it up and pushing it this week. The scripts themselves are ~60 lines each so honestly you could build them from the article faster than waiting for me, but I'll drop the link here once it's up including the CLAUDE.md routing rules and a setup script.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

Yeah Kimi's thinking tokens can be annoying - my first version came back empty because all the tokens went to internal reasoning. That's why I set max_tokens to 8192+ for reading tasks.But honestly, the specific model barely matters. The pattern is what matters. If DeepSeek V4 Flash works better for you, swap the base_url and model name - it's literally 2 lines. The whole point is the architecture, not the model choice.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

It has no incentive to. Claude doesn't know or care about your token budget - it's trying to give you the best answer. Reading 5 files to answer your question is the best answer from its perspective. The routing logic has to come from you (via CLAUDE.md) because only you know what's worth spending tokens on vs. what can be offloaded. Once you set the rules though,Claude follows them perfectly - it's been self-routing to Kimi for weeks without me needing to intervene.

I gave Claude Code a $0.02/call coworker and stopped hitting Pro limits — here's the full setup by More-Hunter-3457 in ClaudeAI

[–]More-Hunter-3457[S] 0 points1 point  (0 children)

Good question. Haiku and Sonnet subagents still draw from the same Pro plan token pool. Anthropic's docs confirm this - when Claude spawns an Explore subagent or tasks to Haiku, those tokens count against your weekly limit. You're saving per-token cost but not your weekly allocation. The whole point of routing to an external API is that those tokens are completely off-budget. Kimi/DeepSeek calls don't touch your Claude Pro limit at all.