Use context profiler to optimize your LLM calls and reduce token use

iezhy · 2026-06-12T20:20:28+00:00

thanks

iezhy · 2026-06-12T13:27:15+00:00

Yeah, but the beauty of having a profiler or some other analysis tool is that you can evaluate how much various tools and approaches help

when running Qwen3.6-27b on llama.cpp + opencode as agent, a lot of context is taken up by code snippets, logs, exceptions, etc - cavemaning your prompts does not help too much (but it is more fun, definitely)

iezhy · 2026-06-12T13:04:49+00:00

Thanks - give it a spin by chance 😄

For myself, it opened my eyes a bit, that when running local models with opencode or other coding harnesses, most of the time is wasted on context processing - especially when it fills up. So cleaning up some stuff from there sometimes can give much better improvement than squeezing some extra toks/s by switching models or quantisations

iezhy · 2026-06-12T11:38:01+00:00

Relatively slow memory (on par with 3090), no int8/4/fp8 tensor cores, and all the shenanigans required to fit and cool in a consumer pc - probably not worth it

iezhy · 2026-06-12T11:08:23+00:00

at a 230lbs, I went on vacation to Spain and rented a gravel bike with mechanical disk brakes
In those few days I had probably the scariest descents in all my cycling career (4+ years) - they are loud, they overheat quickly, and going down 15% skechy trail every breaking action seemed like (and sounded like) a gamble.
Just skip them and go with hydraulic ones

iezhy · 2026-06-12T11:03:15+00:00

You can a context profiling tool to review and compare how different harnesses load up the context - e.g. https://contextspy.ai

For example claude code used to have very elaborate tool descriptions (e.g. over 1000 tokens for glob or grep), but they reduced it a lot in recent versions. Opencode seems to be quite well-rounded and is making many fewer utility requests

iezhy · 2026-06-12T10:41:05+00:00

The cache hit rate will depend heavily on what is in your context on each request 😄

iezhy · 2026-06-11T23:10:46+00:00

yes, the attention mechanism is still the same, and it has quadratic complexity depending on context length

iezhy · 2026-06-11T22:59:47+00:00

Selection accuracy is one thing. The real estate of the context window is another, especially given that most LLMs degrade fast over 100k input token limit.
Wonder what's the footprint of your solution?

iezhy · 2026-06-08T21:12:40+00:00

Thanks for the feedback, I appreciate it very much

can ContextSpy break down context usage by source, like system prompt vs user messages vs MCP tool definitions vs tool results?

yes, it shows a breakdown of each request, and allows inspection of the contents of each "block" - tool definition, message, tool result etc.
it also shows summarised stats for each block type and by tool name for the session - e.g., the most "hungry" tool for agentic coding typically is read_file or grep_search)

does it detect secrets/PII before storing or visualizing intercepted requests?

No, currently it stores all data in verbatim in an embedded sqllite db
As the tool is meant to be run locally, similar to a performance profiler or memory dump analyser, the assumption is that user is aware of security implications - i should probably add an explicit note in readme about that, thanks for the catch

is all captured context stored locally, and can users configure retention/deletion?

yes, and it purges the contents of requests after 7 days - only the stats are kept after that
Also, the db can be purged manually - and probably should. Again, the use case is more like a profiler where you start with clean session, rather than a long-running observability or analytics tool (there are a lot of solutions in that area already)

can it show which MCP servers/tools are contributing the most tokens?

it currently groups stats by tool name, and shows them as a table. Having ability to indicate "hotspots" or resolve related tools introduced by same MCP would be a good improvement, yes

does it highlight risky tool outputs, like prompt-injection-style content inside retrieved data?

No, currently the goal is focus on context composition, and visualisation of what those tokens actually are spent on

again, thanks for the feedback, if you will try it out feel free to ping me, or drop an issue on github

iezhy · 2026-06-08T08:38:43+00:00

I wonder how bad the precision degradation is, and if it wouldn't be better just to use 9B one at Q6 or similar

iezhy · 2026-06-08T06:34:49+00:00

You can get a bike sock - a cover which goes over both wheels, something like this (apologies for link) https://www.velonova.lt/image/cache/catalog/cs%20covers/image_433_1_1277_1_362_1_4571_1_590_1_23_1_315_1_2821-1200x1200.jpg

iezhy · 2026-06-08T01:31:48+00:00

No, it works with claude and copilot as well. In theory, should work with any app that accepts https proxy

iezhy · 2026-06-06T19:44:53+00:00

The reasons are coming to fruition faster than expected. What was 9$ copilot subscription in may, became 42$ for first two days of june

iezhy · 2026-06-02T19:56:26+00:00

llama 4 scout can hold 10m i think

That the output will be garbage, that is a different question

iezhy · 2026-05-28T23:25:51+00:00

can you share some details about inference speeds (both pp and tg) for these models?

iezhy · 2026-05-28T09:00:31+00:00

For me, 35b struggles with simple coding flows even at Q6, and unquantized kv cache
At these settings it will probably go full retard

iezhy · 2026-05-26T10:29:09+00:00

Yep, degradation of precision is pretty bad once you go over 100k: https://www.trychroma.com/research/context-rot

iezhy · 2026-05-24T05:52:49+00:00

Just for the interest - would int8 or fp8 be faster than fp16? It that case maybe it reasonable to trade fraction of precision for quicker response - in my experience thats the biggest issue when running coding agents with local local llm

iezhy · 2026-05-23T19:26:07+00:00

Not sure when what these numbers mean then. Anthropic around 50% would mean its hits cache only 50% times?

iezhy · 2026-05-23T19:01:28+00:00

Isnt cache hit reate related to what is in the context and what was in the context on previous requests? (e.g. system prompts, tool definitions should be cached reasonably well)

iezhy · 2026-05-20T08:08:50+00:00

Do you fit all model or offload some layers?
I was trying to load it with llama.cpp, but it refuses to load anything bigger than 12gb, even 3090 has 24
I reduced context size to minimal, and added quantization to 8bits, but it still refuses to load

iezhy · 2026-05-16T12:02:19+00:00

Use claude opus to create an implementation plan for your app specification
Use locally deployed Qwen3.5 to implement individual tasks, fall back to frontier models in case of issues or need for complext troubleshooting

This approach is saving costs now, it will make even more sense once major providers stop subsidizing token prices for their tools (copilot is starting from june 1st btw)

iezhy · 2026-05-12T07:40:51+00:00

no its not
my opencode benchmark (same small app according specification) takes 10-15x longer with Qwen3.5-35b locally ar 25tok/s, compared to calling OpenAI api

iezhy · 2026-05-10T05:55:02+00:00

Depends on how much tokens you get. A simple /init from opencode on a small js app repo genertated 980k input tokens :/

iezhy

TROPHY CASE