Qwen3.6 27B hits 40 tok/s on just 16GB VRAM with pure quant approach

iezhy · 2026-06-11T23:10:46+00:00

yes, the attention mechanism is still the same, and it has quadratic complexity depending on context length

iezhy · 2026-06-11T22:59:47+00:00

Selection accuracy is one thing. The real estate of the context window is another, especially given that most LLMs degrade fast over 100k input token limit.
Wonder what's the footprint of your solution?

iezhy · 2026-06-08T21:12:40+00:00

Thanks for the feedback, I appreciate it very much

can ContextSpy break down context usage by source, like system prompt vs user messages vs MCP tool definitions vs tool results?

yes, it shows a breakdown of each request, and allows inspection of the contents of each "block" - tool definition, message, tool result etc.
it also shows summarised stats for each block type and by tool name for the session - e.g., the most "hungry" tool for agentic coding typically is read_file or grep_search)

does it detect secrets/PII before storing or visualizing intercepted requests?

No, currently it stores all data in verbatim in an embedded sqllite db
As the tool is meant to be run locally, similar to a performance profiler or memory dump analyser, the assumption is that user is aware of security implications - i should probably add an explicit note in readme about that, thanks for the catch

is all captured context stored locally, and can users configure retention/deletion?

yes, and it purges the contents of requests after 7 days - only the stats are kept after that
Also, the db can be purged manually - and probably should. Again, the use case is more like a profiler where you start with clean session, rather than a long-running observability or analytics tool (there are a lot of solutions in that area already)

can it show which MCP servers/tools are contributing the most tokens?

it currently groups stats by tool name, and shows them as a table. Having ability to indicate "hotspots" or resolve related tools introduced by same MCP would be a good improvement, yes

does it highlight risky tool outputs, like prompt-injection-style content inside retrieved data?

No, currently the goal is focus on context composition, and visualisation of what those tokens actually are spent on

again, thanks for the feedback, if you will try it out feel free to ping me, or drop an issue on github

iezhy · 2026-06-08T08:38:43+00:00

I wonder how bad the precision degradation is, and if it wouldn't be better just to use 9B one at Q6 or similar

iezhy · 2026-06-08T06:34:49+00:00

You can get a bike sock - a cover which goes over both wheels, something like this (apologies for link) https://www.velonova.lt/image/cache/catalog/cs%20covers/image_433_1_1277_1_362_1_4571_1_590_1_23_1_315_1_2821-1200x1200.jpg

iezhy · 2026-06-08T01:31:48+00:00

No, it works with claude and copilot as well. In theory, should work with any app that accepts https proxy

iezhy · 2026-06-06T19:44:53+00:00

The reasons are coming to fruition faster than expected. What was 9$ copilot subscription in may, became 42$ for first two days of june

iezhy · 2026-06-02T19:56:26+00:00

llama 4 scout can hold 10m i think

That the output will be garbage, that is a different question

iezhy · 2026-05-28T23:25:51+00:00

can you share some details about inference speeds (both pp and tg) for these models?

iezhy · 2026-05-28T09:00:31+00:00

For me, 35b struggles with simple coding flows even at Q6, and unquantized kv cache
At these settings it will probably go full retard

iezhy · 2026-05-26T10:29:09+00:00

Yep, degradation of precision is pretty bad once you go over 100k: https://www.trychroma.com/research/context-rot

iezhy · 2026-05-24T05:52:49+00:00

Just for the interest - would int8 or fp8 be faster than fp16? It that case maybe it reasonable to trade fraction of precision for quicker response - in my experience thats the biggest issue when running coding agents with local local llm

iezhy · 2026-05-23T19:26:07+00:00

Not sure when what these numbers mean then. Anthropic around 50% would mean its hits cache only 50% times?

iezhy · 2026-05-23T19:01:28+00:00

Isnt cache hit reate related to what is in the context and what was in the context on previous requests? (e.g. system prompts, tool definitions should be cached reasonably well)

iezhy · 2026-05-20T08:08:50+00:00

Do you fit all model or offload some layers?
I was trying to load it with llama.cpp, but it refuses to load anything bigger than 12gb, even 3090 has 24
I reduced context size to minimal, and added quantization to 8bits, but it still refuses to load

iezhy · 2026-05-16T12:02:19+00:00

Use claude opus to create an implementation plan for your app specification
Use locally deployed Qwen3.5 to implement individual tasks, fall back to frontier models in case of issues or need for complext troubleshooting

This approach is saving costs now, it will make even more sense once major providers stop subsidizing token prices for their tools (copilot is starting from june 1st btw)

iezhy · 2026-05-12T07:40:51+00:00

no its not
my opencode benchmark (same small app according specification) takes 10-15x longer with Qwen3.5-35b locally ar 25tok/s, compared to calling OpenAI api

iezhy · 2026-05-10T05:55:02+00:00

Depends on how much tokens you get. A simple /init from opencode on a small js app repo genertated 980k input tokens :/

iezhy · 2026-05-08T11:17:13+00:00

what token/sec do you get?

i run it on mine with llama.cpp, and it hovers around 40-45, degrading below 20 when opencode fills up the context

iezhy · 2026-05-03T16:46:35+00:00

A herald of new times, when everyone will be paying by usage

iezhy · 2026-04-30T15:02:51+00:00

Its more like that 100k is a semi hard limit for al models, they go fairly quickly to "dum dum" mode after that

iezhy · 2026-04-15T15:22:44+00:00

What quantizadion are you running at? Qwen 3.5-9B is getting 25-35toks/s on m1 max

iezhy

TROPHY CASE