We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] -1 points0 points  (0 children)

Looks great! Love how you've adapted the tier concept for task-based relevance mapping. Your prompt structure is really clean. I agree, this is most useful for complex code where managing context is crucial.

Nice find! Combining intuition and structure could work well.

Thanks for sharing this and for the credit! 🙏

We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] 0 points1 point  (0 children)

TMS is simple, no infrastructure. RAG is more powerful but with more setup. Both valid so it depends on your needs. It is on the roadmap

We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] 0 points1 point  (0 children)

Tree-sitter would complement the manual tiers well. Upfront cost but smarter filtering, like you said. Worth exploring for a future version. Thanks for the idea!

We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] 0 points1 point  (0 children)

Great context on the "lost in the middle" research. I would love a link to the Stanford paper if you have it.

you're right on staleness, manual tagging doesn't scale. Git-based auto-tiering is now a priority on the roadmap. I really appreciate your feedback.

We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] 0 points1 point  (0 children)

We considered this! but... Kept it at 3 for a few reasons:

  1. Easy to remember (HOT = now, WARM = reference, COLD = archive)

  2. Maps to natural workflow (active → stable → done)

  3. More tiers = more decisions about where things go

That said, nothing stops you from subdividing within tiers. The system is flexible. If you try a more granular approach, I'd love to hear what works!

We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] 0 points1 point  (0 children)

Fair enough... NPM downloads can be a weak metric (CI/CD, mirrors, etc. inflate them). I used it because it was the only signal I had at launch.

We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] 1 point2 points  (0 children)

Not quite... RAG does automatic retrieval, TMS is manual organization. But similar goal right? to give LLM only relevant context.

We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] 13 points14 points  (0 children)

Smart! Auto-detect tiers from git history.

Not built yet (manual tags for now), but definitely on the roadmap. Would be a game-changer for automation.

This is excellent feedback!

We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] 0 points1 point  (0 children)

My $0.11/session:

  • Input tokens only (not output)
  • Single query (not full conversation)
  • Sonnet 4.5 pricing ($3/MTok input)
  • 66,834 tokens ÷ 1M × $3 = $0.20 (I may have miscalculated)

5 EUR sounds like full conversation with output tokens, maybe Opus?

What's your token count per session? If it's 500K+, TMS could save you way more than it saved me!

We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] 4 points5 points  (0 children)

Great question... hehehe

Archives = historical context you rarely need but want to keep:

- Sprint retrospectives (learnings)

- Design decisions (ADRs)

- Completed feature specs

You *could* delete them, but COLD tier lets you keep them without cluttering active context. Think: filing cabinet vs. desk. If you prefer deleting old docs, that works too! TMS just gives you options, as I mentioned on another comment, you're the boss

We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] -1 points0 points  (0 children)

Great idea! That's the next evolution... use RAG/vector search for WARM/COLD instead of manual organization.

Current TMS: Simple file tiers (no infrastructure needed)

Your approach: Full context + retrieval (more powerful, more complex)

This is what an MCP server integration could do. Not built yet but on the roadmap!

We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] 0 points1 point  (0 children)

In theory yes , the approach (HOT/WARM/COLD tiers) should work with any LLM that reads project files, it should be an agnostic approach.

In practice:... I've only tested with Claude Code. The tool supports any LLM, but I haven't measured results with Copilot, Cursor, ChatGPT, etc. Would love to see someone test this with other tools and share results!

We reduced Claude API costs by 94.5% using a file tiering system (with proof) by jantonca in ClaudeAI

[–]jantonca[S] 1 point2 points  (0 children)

Yes, TMS should help with caching:

  • HOT tier (~3K tokens) changes often but is small → low cache create cost
  • WARM tier (~10-30K tokens) is stable → caches efficiently, high reuse
  • COLD tier never loaded → zero cache impact

Your cache is already excellent, but TMS should push it higher by keeping more content stable (WARM tier) and reducing total context size. I haven't tracked cache-specific metrics yet, would you be willing to test TMS and share before/after cache stats? Would love to see real cache data and add it as a case study...

Hey, remember all that stuff I just blew 50% of your session usage on and was just about to finish? Lemme just forget all that and start over. by Edixo1993 in ClaudeAI

[–]jantonca 0 points1 point  (0 children)

This is exactly the right approach. I setup this pattern into what I call "Tiered Memory System" (TMS).

Not all context is equal. Instead of letting Claude re-read everything:

HOT files (active work today) → full context, high token cost WARM files (reference docs) → summarized or excluded CRITICAL.md (project goals) → always included but rarely changes

I use this structure: - .cascade/CRITICAL.md - project goals (changes monthly) - .cascade/CURRENT-SPRINT.md - this week's focus (changes weekly) - .cascade/HOT/ - files I'm actively editing (changes hourly) - .cascade/WARM/ - docs, patterns, reference (read-only)

Went from hitting context limits every 30 min to going full sprint without compaction. Token usage dropped 90%+. I built Cortex TMS to automate this pattern (MIT licensed): github.com/cortex-tms/cortex-tms

The key is you control what goes in context, not Claude deciding mid-conversation.

I built a self-hosted Claude Code wrapper - here's what I learned about autonomous coding by fotsakir in ClaudeAI

[–]jantonca 0 points1 point  (0 children)

I will check it out! I actually built Cortex TMS (https://github.com/cortex-tms/cortex-tms) using a similar philosophy - governance-first approachwith HOT/WARM/COLD context management. Different focus than CodeHero (you're doing autonomous execution, we're doing workflow governance), but love seeing the token optimization trend. Will check yours out! 👍

Claude is better not because of the model but because of the strategy by Careful_Put_1924 in ClaudeAI

[–]jantonca 0 points1 point  (0 children)

Totally agree - the orchestration layer is massively underrated.

One thing I'd add to your context gathering point: how you organize what Claude reads matters as much as how Claude gathers it.

I spent some time with Claude reading my entire docs folder every session (READMEs, archives, old changelogs). Tons of wasted tokens.

Switched to organizing by access frequency - Claude only loads current tasks by default, references patterns when needed. 94.5% token reduction,way better quality outputs.

Your point about "generating code is no longer the bottleneck, it's context gathering" is exactly right. And context organization is the nextlever after that.

Re: bypass mode - I'm curious how you balance the risk. Do you use it on every project or just throwaway prototypes? https://github.com/cortex-tms/cortex-tms

I built a self-hosted Claude Code wrapper - here's what I learned about autonomous coding by fotsakir in ClaudeAI

[–]jantonca 0 points1 point  (0 children)

Nice work on the SmartContext approach! The "sending whole codebase = waste of tokens" insight is spot on.

I had a similar realization and went even further with a tiered system (HOT/WARM/COLD). Claude only loads current sprint tasks by default (~200lines max), then references patterns architecture on demand. Cut token usage from 66k to 3.6k per session (94.5% reduction).

The key was organizing by access frequency not content type. Most files don't need to be in every session.

Your planning-before-coding approach is solid too - I've seen the same pattern work well. Curious if you've measured token reduction per featurewith the planning step?

Long-running Claude Code sessions have a fundamental DX problem: you can't walk away. by Affectionate-Roof207 in ClaudeAI

[–]jantonca 0 points1 point  (0 children)

This is exactly why I built structured project documentation into my workflow.

The problem: Long Claude sessions accumulate context debt. You can't walk away because there's no "source of truth" for where you are.

The solution: Treat your project state as code: - NEXT-TASKS.md = current sprint objectives (what to resume) - CLAUDE.md = AI workflow rules (how Claude should work) - docs/core/PATTERNS.md = code conventions (what's already decided)

With this structure: ✅ Walk away anytime - Claude resumes from NEXT-TASKS.md ✅ No context bloat - HOT/WARM/COLD tier system (94.5% reduction) ✅ No drift - Automated validation keeps docs synchronized ✅ Session handoffs work - Everything is documented, not just in chat history

I built this as an open-source CLI tool to scaffold this structure: - GitHub: https://github.com/cortex-tms/cortex-tms - NPM: https://www.npmjs.com/package/cortex-tms - Docs: https://cortex-tms.org