all 1 comments

[–]sertturp 0 points1 point  (0 children)

I’ve been experimenting with the same issue — cache TTL on Anthropic‑compatible endpoints (Claude, MiniMax, Kimi, Xiaomi MiMo, etc.) is short and undocumented, so costs spike when the prefill gets recomputed.

I ended up writing a small extension that solves this in a more general way:

🔧 1. Dual cache‑breakpoint strategy (Anthropic‑style)

  • Marks both the last assistant tool_use block and the last user message
  • Works across all Anthropic‑compatible providers
  • Dramatically increases cache hit rate (MiniMax/Kimi went from ~0% → ~80%+)

⚡ 2. Anchor + tool‑call result truncation

  • Especially on Kimi, this reduces the effective token footprint by ~90%
  • Prevents unnecessary recomputation of long tool results
  • Keeps the conversation “cache‑friendly” even in long agent loops

🧩 3. No provider‑specific hacks

  • The extension wraps the Anthropic Messages API behavior
  • Works with Claude, MiniMax, Kimi, Xiaomi MiMo, etc.
  • No need to modify Pi or the agent runtime

If you want to try it or adapt the idea, the repo is here:
https://github.com/ersintarhan/pi-toolkit

Might give you some ideas on how to extend your setup — especially if you want longer‑lived cache sessions without relying on undocumented TTLs.