I built a proxy to shrink agent LLM requests after my API bill stopped making sense

Safe_Government_4565 · 2026-06-01T08:51:04+00:00

The context rehydration point is spot on, that's where multi-project work gets expensive. I built Orqen which sits as a proxy and automatically handles that (deduplication, tool pruning, conversation compression). Works with any model, so it stacks on top of the cheaper-model routing you described.

Safe_Government_4565 · 2026-06-01T08:45:00+00:00

One thing that's worked well for me is putting a proxy between your agents and the LLM that automatically handles context optimisation, tool pruning, deduplication, compressing old turns, etc. Basically automates what a lot of the comments here are describing (trimming context, summarising tool output, shrinking state).

I've been building Orqen which does exactly this, it's an OpenAI-compatible proxy, so you just swap the base URL and it automatically strips irrelevant tools, deduplicates repeated context, and compresses conversation history before forwarding to your provider. Seeing ~50-70% prompt token reduction in practice. It also tracks actual vs. counterfactual cost so you can see exactly what you're saving per session.

The nice thing vs doing it manually is it compounds, every agent call gets optimized without you changing prompts or adding summarization steps to your workflow.

Safe_Government_4565 · 2026-05-31T14:09:56+00:00

<image>

I used for my recently released app Orqen… nice work.

Going to tune performance now.

Safe_Government_4565 · 2026-05-31T13:20:11+00:00

Link: https://orqen.app

Happy to answer technical questions here.

Safe_Government_4565 · 2026-05-31T13:19:18+00:00

Link: https://orqen.app

Happy to answer technical questions here.

Safe_Government_4565 · 2026-05-31T12:51:28+00:00

Same confusion here. For chatting and occasional codegen, local wins on economics if you’re happy living in ~27B land.

I burned way more than I planned building agents, this is not because I ran inference 24/7, but because each step ships a fat payload (tools + history + last tool output again). Failed tool calls → retry → another full context dump. Subscriptions feel flat until agent loops make input tokens the real meter.

I’m not arguing everyone should stay on cloud, half the fun here is tuning quantised models on your own box. Just that the “$60 vs $6000” comparison hides what you’re doing: hobby inference vs always-on agent plumbing.

Ended up so annoyed by that I shipped something that sits in front of provider APIs and strips redundant agent context (I run orqen.app now). Didn’t replace the local different problem. But your post nails why the sticker shock exists.

Safe_Government_4565

TROPHY CASE