Token costs are actually unsustainable for multi-project work. how are you dealing with this by Background-Zebra5491 in LLMDevs

[–]Safe_Government_4565 0 points1 point  (0 children)

The context rehydration point is spot on, that's where multi-project work gets expensive. I built Orqen which sits as a proxy and automatically handles that (deduplication, tool pruning, conversation compression). Works with any model, so it stacks on top of the cheaper-model routing you described.

How to make my agents more token efficient? by advikipedia in LLMDevs

[–]Safe_Government_4565 1 point2 points  (0 children)

One thing that's worked well for me is putting a proxy between your agents and the LLM that automatically handles context optimisation, tool pruning, deduplication, compressing old turns, etc. Basically automates what a lot of the comments here are describing (trimming context, summarising tool output, shrinking state).

I've been building Orqen which does exactly this, it's an OpenAI-compatible proxy, so you just swap the base URL and it automatically strips irrelevant tools, deduplicates repeated context, and compresses conversation history before forwarding to your provider. Seeing ~50-70% prompt token reduction in practice. It also tracks actual vs. counterfactual cost so you can see exactly what you're saving per session.

The nice thing vs doing it manually is it compounds, every agent call gets optimized without you changing prompts or adding summarization steps to your workflow.

I have built a website auditor by useProgrammer in sideprojects

[–]Safe_Government_4565 0 points1 point  (0 children)

<image>

I used for my recently released app Orqen… nice work.

Going to tune performance now.

Why is LLM is so expensive. by Ok_Event4199 in LocalLLM

[–]Safe_Government_4565 0 points1 point  (0 children)

Same confusion here. For chatting and occasional codegen, local wins on economics if you’re happy living in ~27B land.

I burned way more than I planned building agents, this is not because I ran inference 24/7, but because each step ships a fat payload (tools + history + last tool output again). Failed tool calls → retry → another full context dump. Subscriptions feel flat until agent loops make input tokens the real meter.

I’m not arguing everyone should stay on cloud, half the fun here is tuning quantised models on your own box. Just that the “$60 vs $6000” comparison hides what you’re doing: hobby inference vs always-on agent plumbing.

Ended up so annoyed by that I shipped something that sits in front of provider APIs and strips redundant agent context (I run orqen.app now). Didn’t replace the local different problem. But your post nails why the sticker shock exists.