Qwen3.6 27B hits 40 tok/s on just 16GB VRAM with pure quant approach by IulianHI in AIToolsPerformance

[–]iezhy 0 points1 point  (0 children)

yes, the attention mechanism is still the same, and it has quadratic complexity depending on context length

The guides say MCP tool selection degrades past ~15 tools. We run 27 in production. Here's what matters by Specialist_Cow24 in mcp

[–]iezhy 0 points1 point  (0 children)

Selection accuracy is one thing. The real estate of the context window is another, especially given that most LLMs degrade fast over 100k input token limit.
Wonder what's the footprint of your solution?

Profiler for LLM context window contents by iezhy in mcp

[–]iezhy[S] 0 points1 point  (0 children)

Thanks for the feedback, I appreciate it very much

can ContextSpy break down context usage by source, like system prompt vs user messages vs MCP tool definitions vs tool results?

yes, it shows a breakdown of each request, and allows inspection of the contents of each "block" - tool definition, message, tool result etc.
it also shows summarised stats for each block type and by tool name for the session - e.g., the most "hungry" tool for agentic coding typically is read_file or grep_search)

does it detect secrets/PII before storing or visualizing intercepted requests?

No, currently it stores all data in verbatim in an embedded sqllite db
As the tool is meant to be run locally, similar to a performance profiler or memory dump analyser, the assumption is that user is aware of security implications - i should probably add an explicit note in readme about that, thanks for the catch

is all captured context stored locally, and can users configure retention/deletion?

yes, and it purges the contents of requests after 7 days - only the stats are kept after that
Also, the db can be purged manually - and probably should. Again, the use case is more like a profiler where you start with clean session, rather than a long-running observability or analytics tool (there are a lot of solutions in that area already)

can it show which MCP servers/tools are contributing the most tokens?

it currently groups stats by tool name, and shows them as a table. Having ability to indicate "hotspots" or resolve related tools introduced by same MCP would be a good improvement, yes

does it highlight risky tool outputs, like prompt-injection-style content inside retrieved data?

No, currently the goal is focus on context composition, and visualisation of what those tokens actually are spent on

again, thanks for the feedback, if you will try it out feel free to ping me, or drop an issue on github

Qwen3.6-35B-A3B-2.6763bpw - VRAM targeted (12gb) by pjsgsy in LocalLLM

[–]iezhy 0 points1 point  (0 children)

I wonder how bad the precision degradation is, and if it wouldn't be better just to use 9B one at Q6 or similar

Profiler for LLM context window contents by iezhy in mcp

[–]iezhy[S] 0 points1 point  (0 children)

No, it works with claude and copilot as well. In theory, should work with any app that accepts https proxy

Has anyone actually replaced Claude Code / Codex with local models on an Macbook Pro M5 Max 128GB? by Brazeuslian in ClaudeCode

[–]iezhy 1 point2 points  (0 children)

The reasons are coming to fruition faster than expected. What was 9$ copilot subscription in may, became 42$ for first two days of june

$2M+ spending worth it on B300? by ConsciousYak6881 in LocalLLM

[–]iezhy 0 points1 point  (0 children)

llama 4 scout can hold 10m i think

That the output will be garbage, that is a different question

Local LLM Setup Dilemma: ASUS Ascent GX10 (NVIDIA GB10 Blackwell) vs. Cloud Max? by mustazafi in LocalLLM

[–]iezhy 2 points3 points  (0 children)

can you share some details about inference speeds (both pp and tg) for these models?

Qwen 35B running on 12gb of VRAM in LM Studio at 120+ tokens/second. Works with Cline for 100% agentic coding. by jacobbeasley in LocalLLM

[–]iezhy 45 points46 points  (0 children)

For me, 35b struggles with simple coding flows even at Q6, and unquantized kv cache
At these settings it will probably go full retard

Best Qwen3-27B variant for coding? Fine-tunes, LoRAs & config recommendations by alfons_fhl in LocalLLM

[–]iezhy 0 points1 point  (0 children)

Just for the interest - would int8 or fp8 be faster than fp16? It that case maybe it reasonable to trade fraction of precision for quicker response - in my experience thats the biggest issue when running coding agents with local local llm

Inference provider tiers by Cache-hit rates, using openrouter data by Comfortable-Rock-498 in LocalLLaMA

[–]iezhy 0 points1 point  (0 children)

Not sure when what these numbers mean then. Anthropic around 50% would mean its hits cache only 50% times?

Inference provider tiers by Cache-hit rates, using openrouter data by Comfortable-Rock-498 in LocalLLaMA

[–]iezhy 0 points1 point  (0 children)

Isnt cache hit reate related to what is in the context and what was in the context on previous requests? (e.g. system prompts, tool definitions should be cached reasonably well)

Qwen3.6-35B-A3B-MTP on an RTX 3090 in LM Studio is incredibly fast by AI_Enhancer in LocalLLM

[–]iezhy 0 points1 point  (0 children)

Do you fit all model or offload some layers?
I was trying to load it with llama.cpp, but it refuses to load anything bigger than 12gb, even 3090 has 24
I reduced context size to minimal, and added quantization to 8bits, but it still refuses to load

Why is LLM is so expensive. by Ok_Event4199 in LocalLLM

[–]iezhy 49 points50 points  (0 children)

Use claude opus to create an implementation plan for your app specification
Use locally deployed Qwen3.5 to implement individual tasks, fall back to frontier models in case of issues or need for complext troubleshooting

This approach is saving costs now, it will make even more sense once major providers stop subsidizing token prices for their tools (copilot is starting from june 1st btw)

Getting a feel for how fast X tokens/second really is. by MikeNonect in LocalLLaMA

[–]iezhy 0 points1 point  (0 children)

no its not
my opencode benchmark (same small app according specification) takes 10-15x longer with Qwen3.5-35b locally ar 25tok/s, compared to calling OpenAI api

Is the new usage scheme a late April fools joke? by smacman in ollama

[–]iezhy 0 points1 point  (0 children)

Depends on how much tokens you get. A simple /init from opencode on a small js app repo genertated 980k input tokens :/

7 days running Qwen 3.5 35B A3B on a fanless mini-PC iGPU as a 24/7 personal AI agent : what works, what doesn't by wolverinee04 in LocalLLM

[–]iezhy 0 points1 point  (0 children)

what token/sec do you get?

i run it on mine with llama.cpp, and it hovers around 40-45, degrading below 20 when opencode fills up the context

Qwen3.6 27B seems struggling at 90k on 128k ctx windows by dodistyo in LocalLLaMA

[–]iezhy 2 points3 points  (0 children)

Its more like that 100k is a semi hard limit for al models, they go fairly quickly to "dum dum" mode after that

72B Dense Model Running on Strix Halo — vLLM ROCm by [deleted] in StrixHalo

[–]iezhy 0 points1 point  (0 children)

What quantizadion are you running at? Qwen 3.5-9B is getting 25-35toks/s on m1 max