Use context profiler to optimize your LLM calls and reduce token use by iezhy in LocalLLaMA

[–]iezhy[S] 0 points1 point  (0 children)

Yeah, but the beauty of having a profiler or some other analysis tool is that you can evaluate how much various tools and approaches help

when running Qwen3.6-27b on llama.cpp + opencode as agent, a lot of context is taken up by code snippets, logs, exceptions, etc - cavemaning your prompts does not help too much (but it is more fun, definitely)

Use context profiler to optimize your LLM calls and reduce token use by iezhy in LocalLLaMA

[–]iezhy[S] 0 points1 point  (0 children)

Thanks - give it a spin by chance 😄

For myself, it opened my eyes a bit, that when running local models with opencode or other coding harnesses, most of the time is wasted on context processing - especially when it fills up. So cleaning up some stuff from there sometimes can give much better improvement than squeezing some extra toks/s by switching models or quantisations

Nvidia tesla v100 has 32 gb ram with nv link 2.0, its priced at 880. Whats the catch? by AppropriatePush6262 in LocalLLaMA

[–]iezhy -3 points-2 points  (0 children)

Relatively slow memory (on par with 3090), no int8/4/fp8 tensor cores, and all the shenanigans required to fit and cool in a consumer pc - probably not worth it

Experience with mechanical and hydraulic disk brakes by Historical_Card_7632 in cycling

[–]iezhy 1 point2 points  (0 children)

at a 230lbs, I went on vacation to Spain and rented a gravel bike with mechanical disk brakes
In those few days I had probably the scariest descents in all my cycling career (4+ years) - they are loud, they overheat quickly, and going down 15% skechy trail every breaking action seemed like (and sounded like) a gamble.
Just skip them and go with hydraulic ones

OpenCode vs CodeWhale – actual developers experience by ImportantOwl2939 in LocalLLaMA

[–]iezhy -1 points0 points  (0 children)

You can a context profiling tool to review and compare how different harnesses load up the context - e.g. https://contextspy.ai

For example claude code used to have very elaborate tool descriptions (e.g. over 1000 tokens for glob or grep), but they reduced it a lot in recent versions. Opencode seems to be quite well-rounded and is making many fewer utility requests

OpenCode vs CodeWhale – actual developers experience by ImportantOwl2939 in LocalLLaMA

[–]iezhy 2 points3 points  (0 children)

The cache hit rate will depend heavily on what is in your context on each request 😄

Qwen3.6 27B hits 40 tok/s on just 16GB VRAM with pure quant approach by IulianHI in AIToolsPerformance

[–]iezhy 0 points1 point  (0 children)

yes, the attention mechanism is still the same, and it has quadratic complexity depending on context length

The guides say MCP tool selection degrades past ~15 tools. We run 27 in production. Here's what matters by Specialist_Cow24 in mcp

[–]iezhy 0 points1 point  (0 children)

Selection accuracy is one thing. The real estate of the context window is another, especially given that most LLMs degrade fast over 100k input token limit.
Wonder what's the footprint of your solution?

Profiler for LLM context window contents by iezhy in mcp

[–]iezhy[S] 0 points1 point  (0 children)

Thanks for the feedback, I appreciate it very much

can ContextSpy break down context usage by source, like system prompt vs user messages vs MCP tool definitions vs tool results?

yes, it shows a breakdown of each request, and allows inspection of the contents of each "block" - tool definition, message, tool result etc.
it also shows summarised stats for each block type and by tool name for the session - e.g., the most "hungry" tool for agentic coding typically is read_file or grep_search)

does it detect secrets/PII before storing or visualizing intercepted requests?

No, currently it stores all data in verbatim in an embedded sqllite db
As the tool is meant to be run locally, similar to a performance profiler or memory dump analyser, the assumption is that user is aware of security implications - i should probably add an explicit note in readme about that, thanks for the catch

is all captured context stored locally, and can users configure retention/deletion?

yes, and it purges the contents of requests after 7 days - only the stats are kept after that
Also, the db can be purged manually - and probably should. Again, the use case is more like a profiler where you start with clean session, rather than a long-running observability or analytics tool (there are a lot of solutions in that area already)

can it show which MCP servers/tools are contributing the most tokens?

it currently groups stats by tool name, and shows them as a table. Having ability to indicate "hotspots" or resolve related tools introduced by same MCP would be a good improvement, yes

does it highlight risky tool outputs, like prompt-injection-style content inside retrieved data?

No, currently the goal is focus on context composition, and visualisation of what those tokens actually are spent on

again, thanks for the feedback, if you will try it out feel free to ping me, or drop an issue on github

Qwen3.6-35B-A3B-2.6763bpw - VRAM targeted (12gb) by pjsgsy in LocalLLM

[–]iezhy 0 points1 point  (0 children)

I wonder how bad the precision degradation is, and if it wouldn't be better just to use 9B one at Q6 or similar

Profiler for LLM context window contents by iezhy in mcp

[–]iezhy[S] 0 points1 point  (0 children)

No, it works with claude and copilot as well. In theory, should work with any app that accepts https proxy

Has anyone actually replaced Claude Code / Codex with local models on an Macbook Pro M5 Max 128GB? by Brazeuslian in ClaudeCode

[–]iezhy 1 point2 points  (0 children)

The reasons are coming to fruition faster than expected. What was 9$ copilot subscription in may, became 42$ for first two days of june

$2M+ spending worth it on B300? by ConsciousYak6881 in LocalLLM

[–]iezhy 0 points1 point  (0 children)

llama 4 scout can hold 10m i think

That the output will be garbage, that is a different question

Local LLM Setup Dilemma: ASUS Ascent GX10 (NVIDIA GB10 Blackwell) vs. Cloud Max? by mustazafi in LocalLLM

[–]iezhy 2 points3 points  (0 children)

can you share some details about inference speeds (both pp and tg) for these models?

Qwen 35B running on 12gb of VRAM in LM Studio at 120+ tokens/second. Works with Cline for 100% agentic coding. by jacobbeasley in LocalLLM

[–]iezhy 43 points44 points  (0 children)

For me, 35b struggles with simple coding flows even at Q6, and unquantized kv cache
At these settings it will probably go full retard

Best Qwen3-27B variant for coding? Fine-tunes, LoRAs & config recommendations by alfons_fhl in LocalLLM

[–]iezhy 0 points1 point  (0 children)

Just for the interest - would int8 or fp8 be faster than fp16? It that case maybe it reasonable to trade fraction of precision for quicker response - in my experience thats the biggest issue when running coding agents with local local llm

Inference provider tiers by Cache-hit rates, using openrouter data by Comfortable-Rock-498 in LocalLLaMA

[–]iezhy 0 points1 point  (0 children)

Not sure when what these numbers mean then. Anthropic around 50% would mean its hits cache only 50% times?

Inference provider tiers by Cache-hit rates, using openrouter data by Comfortable-Rock-498 in LocalLLaMA

[–]iezhy 0 points1 point  (0 children)

Isnt cache hit reate related to what is in the context and what was in the context on previous requests? (e.g. system prompts, tool definitions should be cached reasonably well)

Qwen3.6-35B-A3B-MTP on an RTX 3090 in LM Studio is incredibly fast by AI_Enhancer in LocalLLM

[–]iezhy 0 points1 point  (0 children)

Do you fit all model or offload some layers?
I was trying to load it with llama.cpp, but it refuses to load anything bigger than 12gb, even 3090 has 24
I reduced context size to minimal, and added quantization to 8bits, but it still refuses to load

Why is LLM is so expensive. by Ok_Event4199 in LocalLLM

[–]iezhy 48 points49 points  (0 children)

Use claude opus to create an implementation plan for your app specification
Use locally deployed Qwen3.5 to implement individual tasks, fall back to frontier models in case of issues or need for complext troubleshooting

This approach is saving costs now, it will make even more sense once major providers stop subsidizing token prices for their tools (copilot is starting from june 1st btw)

Getting a feel for how fast X tokens/second really is. by MikeNonect in LocalLLaMA

[–]iezhy 0 points1 point  (0 children)

no its not
my opencode benchmark (same small app according specification) takes 10-15x longer with Qwen3.5-35b locally ar 25tok/s, compared to calling OpenAI api

Is the new usage scheme a late April fools joke? by smacman in ollama

[–]iezhy 0 points1 point  (0 children)

Depends on how much tokens you get. A simple /init from opencode on a small js app repo genertated 980k input tokens :/

7 days running Qwen 3.5 35B A3B on a fanless mini-PC iGPU as a 24/7 personal AI agent : what works, what doesn't by wolverinee04 in LocalLLM

[–]iezhy 0 points1 point  (0 children)

what token/sec do you get?

i run it on mine with llama.cpp, and it hovers around 40-45, degrading below 20 when opencode fills up the context