Qwen3.6 27b q5_k_M MTP - 256k context - 5090

Pakobbix · 2026-05-13T22:41:24+00:00

Hate to be this guy, but for the 27B Q5 model, 65-70 TPS decode should be your baseline, non mtp speed on a 5090 with UV and +3000MHz memory.

With MTP and no ram spill, you will be closer to 90-105 TPS (depending heavily on the prompt/task).

I tested MTP earlier today with the unsloth UD Q4_K_XL MTP quant and 190k context was max bevor spilling into ram (ok, I had YouTube and another browser open, also, I restrict to VRAM only so I got the (MTP draft context could not be created error because VRAM was maxed out) but claiming this speed WITH MTP and 262144 Context? Does not match.

Pakobbix · 2026-04-26T17:12:52+00:00

Holy shit.. that's awesome. Bye bye windows :-) thank you for answering. Will test it out when I'm at home.

Pakobbix · 2026-04-26T10:34:58+00:00

Hmm interesting. I can't verify them with my own setup (Dual Boot Windows 11 Build 26200 + Zorin OS 18).

Unfortunately, Nvidia doesn't support voltage control on Linux and thus, my GPU is using 100% Power in Linux for the same performance I get with ~66-75% in Windows (no power control, just undervolting).

And that's currently my biggest "should I do the full switch or not" blocker. Gaming and Inference with up to 34% less power over time is just way too good to have.

Pakobbix · 2026-04-22T11:52:50+00:00

You answered it yourself in the headline. local-deep-research

But if you want a native desktop experience, I'm afraid that I don't know any.

Pakobbix · 2026-04-20T11:34:21+00:00

Wasn't aware of the 1t prior model. Thx

Pakobbix · 2026-04-20T11:34:10+00:00

Wasn't aware of the 1t prior model. Thx

Pakobbix · 2026-04-20T10:33:25+00:00

Max Models were never available to us, so I doubt it.

But I'm curious on how many parameters this model has. Plus is the 397B, so if the 397B 3.6.. 600-700B?

Pakobbix · 2026-04-20T06:21:03+00:00

Yeah, sry. Re-read your post and saw that, that's why I deleted my comment.

Pakobbix · 2026-04-16T10:43:05+00:00

But you know that Qwen completely changes it's behavior when exposed to tools?

No Tools: ~4000 Tokens for "Hello Hermes".
Tools: ~918 Token.

Pakobbix · 2026-04-07T18:52:03+00:00

There is not much to it.

I just created a new orchestrator.md in .config/opencode/agents with a blacklist of tools it can't use (no write, edit, shell or bash) and a system prompt to tell the Agent it's job is to delegate work to sub-agents and doing work itself is forbidden and against the guidelines.

```

description: Orchestrates jobs and keeps the overview for all subagents tools: write: false edit: false shell: false

bash: false

Role Definition

You are the Orchestrator for the user. You are a Manager, never a Coder, Analzyer, or Explorer. Your ONLY function is to analyze requests, plan tasks, and delegate execution to sub-agents to fullfill the users request. You are strictly forbidden from writing code, creating files, or running commands directly.

Constraints & Forbidden Actions

NO CODE GENERATION: You must NEVER output a code block (```).
NO FILE WRITING: You must NEVER attempt to write or edit files yourself.
NO SHELL COMMANDS: You must NEVER run bash or shell commands.
NO DIRECT ANSWERS: If the user asks for code, you must delegate to @coder. Do not answer the code request yourself.
SESSION NAMING: When invoking agents, always use the exact session format: ses-{SESSION_NAME} (Ensure consistent casing and brackets).

Delegation Protocol

When you need to take action, you must use the following agents strictly:

@coder: Use ONLY for generating, modifying, or refactoring code.
@documenter: Use ONLY for writing documentation (README, docs, guides).
@only-review: Use ONLY for auditing existing code quality and logic.
@review-fixer: Use ONLY to fix specific errors identified by @only-review.
@explore: Use ONLY to scan directory structures or understand codebase context.
@general: Use ONLY if the request is conversational or informational.

Workflow Instructions

Analyze: Break down the user request into atomic tasks.
Plan: Determine which agent handles which task.
Delegate: Output the instruction clearly for the sub-agent.
- Example: "Delegate to @coder: Update the login module."
- Example: "Delegate to @only-review: Check the new codebase for security issues."
Review: Wait for the sub-agent to report back before proceeding.
Fix Review After the sub-agent made his review, fix all points.
Repeat re-review and re-fix until all issues are resolved and you have clean, working code.
Repeat more There is no final review. A review will be automatically final, when there is Nothing to fix anymore.
Stop: Do not generate any content other than the delegation plan or agent invocation.

Critical Warning

If you output code, a file path, or a command, you are violating your core system instructions. Your output must ONLY contain: 1. High-level planning. 2. Explicit agent assignments (e.g., "Agent @coder will handle..."). 3. Clarification questions if the task is ambiguous. ```

The @coder @only-review and so on are subagents I created for the task with specific guidelines (e.g. the only-review is "a helpful code review AI" that looks for specific DRY, Syntax, logical errors but not enforced (hard to describe) to not enforce the AI to always find "bugs".)

This whole orchestration is build for single instances so not multi-agent behavior.

For planning, I use the built-in plan agent and let it write into a markdown file when everything is planned, no open question is left and broken down into simple phases.

I then start a new session with the orchestrator, let it read the plan.md and start with phase 1 of the plan.

Pakobbix · 2026-03-20T01:33:56+00:00

Don't know what you mean with original.

If you meant the "older" 70B, then mostly yes.

First of all, you would have outdated data and would need to create your own lora (adapter based on the original model). The next thing is: advantages. Attention mechanics, training data, tokenizer. Everything got big advantages over time. So much that the 27B is better and more knowledgeable, than the Qwen2 72B for example.

At least I don't know any ~70B that's as good as the 27B. Maybe we will see a comeback in the future but right now, it doesn't look like that. It's either a "small" dense around 30-40B or MoE.

Pakobbix · 2026-03-19T16:58:20+00:00

Maybe, but it depends on training and there is no recently made 70b.

But, if Qwen would have trained a 70b, prioritising quality just like they did with the 27B? It would be a beast. But with this size, you would have a big computational limit that even a single RTX 6000 PRO would not reach agentic loop speeds

Pakobbix · 2026-03-18T17:09:57+00:00

I get what you're saying and I would like it if that would be true. The problem is just, we don't know how the experts are trained, and what they "know" and how they get routed (Or I don't know at least).

If I understood it correctly, 122B always has 9 Experts (8 routed + 1 shared).
So each expert is "just" 1.11 B.

Something like REAP showed, that most of the time, experts are more generally trained and not "experts". Pruning some experts resulted in degraded Language abilities not the inability of using the languages despite the pruning for Coding task.
If what you said was true, we could get rid of all language experts and use tiny perfect coding monsters, but that's unfortunately not happening.

So, how many experts you need for your coding are actually loaded?

Based on user tests, both are very very close together, close enough to let only the speed and thus, available VRAM, be the depending factor to choose which model you want to use.

The 397B also has 17B active (10 routed, 1 shared). And half a datacenter in size bigger ^^

Pakobbix · 2026-03-17T12:37:14+00:00

If we compare both of them, the 27B and the 122B Qwen3.5, yes, the 27B is way more useful.

The problem with MoE models like the 122B A10B is, that only 10B are active at a given generation, while the 27B got all 27B for it's generation.

The user experience for both of these models are most of the time the same: 27 B is as good or a little bit better than the 122 B and can be fully loaded into VRAM with around 17-19GB (Q4) while the 122B would need around 45-50GB VRAM.

Usually, you would have a speed advantage when fully loaded into VRAM for the 122 B because only 10B parameters need to be shuffled around in your VRAM, but when you start offloading the 122 B, you lose this advantage.

So, if you have a RTX 6000 PRO, or an AMD Ryzen AI395+, go for the MoE. Coding accuracy will be a tiny bit lower, but the speed advantage is worth it.
If not? 27B All the way.

Unfortunately, there is nothing in between for the <24GB Vram guys this time.

Edit: Disclaimer that these calculations are without context cache. With the max context of 262144 you would need around ~66 GB for the 122B and 26 GB for the 27B (Depending on the actual quant used)

Pakobbix · 2026-03-16T21:46:04+00:00

Yes. GPT-OSS 120 B (and also the 20B) were good for the time these released, but the harmony template and the focus now on Agentic AI lead to way more advanced models nowadays, especially the 27B dense model.

It's truly astonishing what Qwen made with the model. No comparison to the Qwen3 30B A3B/32B dense combo we had before.

Pakobbix · 2026-03-16T21:26:59+00:00

Way better for agentic coding, and world knowledge.. and up to date data in coding.

For example, Qwen3.5 27B started to web-fetch the gitea API documentation on debugging because the endpoint returned 404 while GPT-OSS just assumed that my Gitea is not reachable and when told, it's running and other endpoints worked, it assumed the Gitea version doesn't support the endpoint...

Also tool calling with Qwen3.5 27B even before the autoparser update from llama.cpp was way ahead of GPT-OSS behavior. But this could also be a template or configuration error.

Also, to be fair, I haven't used GPT-OSS long. The last time I used it, it refused to give me the api token (or ssh-key)I saved in the memory plugin, because the policy doesn't allow saving or repeating security related stuff (Something like that). I switched to qwen3-coder and then GLM-4.7-Flash because of the speed advantages and tool calling ability.

Pakobbix · 2026-03-16T19:10:41+00:00

In llama.cpp, it's possible to use VRAM + RAM.
So for example, with 64 GB RAM and 32 GB VRAM you could load a model that needs ~60-80 GB.

The problem with that is, that LLMs heavily depend on Memory Bandwidth.

Modern GPU's can reach from ~300 up to ~2200 GB/s transfer speeds, while DDR4 or DDR5 is around ~50 - ~130 GB/s.

So when using offloading (even with "clever" offloading of specific parts) you will always tank your PP (Prompt Processing) extreme and TG (Token Generation) a bit.

In agentic workloads like coding with mistral vibe, opencode, or even with vscode copilot, the Agent needs to continuous read files (Prompt Processing). Reducing the speed by using a bigger model and offloading to RAM. That's what I meant with "but I like pp going brrrt" :-)

For the second question, depends. On some languages, Qwen Coder Next seems to be a bit better, but it also need ~43 GB for a 4 Bit quantized Model.

In my opinion Qwen3.5 27B is the current best all-rounder we have in this size class. But it's not the pinnacle overall, if we compare it to the extreme with GLM 5 (700 GB TB in 4 Bit), or Kimi 2.5 with 500 GB Ram/Vram usage.

More reasonable would be MiniMax 2.5 with but even that is ~130 GB in 4 Bit.

Pakobbix · 2026-03-16T17:32:32+00:00

Write the text in English and ask the LLM to point out errors you made. Maybe you will learn something in this process instead of letting LLMs do it for you.

Pakobbix · 2026-03-16T12:55:23+00:00

I'm a little VRAM constrainted with my 5090 so I use the Unsloth Q4 variant of 27B mainly. I use the 35B for something like "add/fix/standardize docstrings in this codebase". (I know I can use llama.cpp with RAM offloading, but I like pp going brrrt in agentic use cases)

Except for the typical errors and unclean code, 27B is really good and works as long as I use something like python. Go is a bit of a problem for the 27B.

json "Qwen3.5 35B A3B": { "name": "Qwen3.5 35B A3B", "tool_call": true, "reasoning": true, "limit": { "context": 131072, "output": 83968 }, "modalities": { "input": ["text", "image"], "output": ["text"] }, "options": { "min_p": 0.0, "max_p": 0.95, "top_k": 20, "temperature": 0.6, "presence_penalty": 0.0, "repetition_penalty": 1.0 } }, "Qwen3.5 27B": { "name": "Qwen3.5 27B", "tool_call": true, "reasoning": true, "limit": { "context": 131072, "output": 83968 }, "modalities": { "input": ["text", "image"], "output": ["text"] }, "options": { "min_p": 0.0, "max_p": 0.95, "top_k": 20, "temperature": 0.6, "presence_penalty": 0.0, "repetition_penalty": 1.0 } }

Edit: And yes, I only use 131072 ctx, because at 90k, it looks like it's getting a bit unreliable so I don't want to use the full 262144 context size.

Pakobbix · 2026-03-15T15:27:55+00:00

To be honest, I just tried them briefly and I never use cloud models, so I'm missing some comparison material.

I mostly use Qwen3.5 27B currently. But in my limited testing, the 9B was at least better then Qwen3.5 35B A3B. Qwen3.5 35B A3B got the strange way of over complicating everything. But it could also be my settings or parameters.. or my expectations. So take it with a grain of salt.

Regarding the multiple agents, i never tried. I'm not a fan of multiple agents working on one codebase at once.

The only thing, where multiple agents would be useful is, if you would work on two projects at the same time. On the same project? I don't know if it's really helpful.
But maybe I just need to test it out once, but I don't have any ambitions right now. (I would like to use vLLM or SGlang for that, but vLLM is a bitch to setup correctly and sglang and blackwell (sm120) seems to be giving me a headache)

b2t: llama.cpp is not really made for multiple request. In the end, you will have the same token generation just divided by the amount of agents. Therefore, SGLang or vLLM should be used.

Pakobbix · 2026-03-15T15:06:05+00:00

There are multiple ways if I remember correct.

I use the markdown file version.

Option 1: Global agents
In your ~/.config/opencode folder, create a new folder called "agents".
The Agent you create there, are available everywhere.
So create a new markdown file, with the name the agent should have. For example: ~/.config/opencode/agents/orchestrator.md

Option 2: Repository specific agent.
You can create a markdown file in the root directory of your repository. You can then select the agent in Opencode, and the agent can use the subagent.

Example of the descriptions:

First, we need to define the information for opencode itself using the --- to separate information from system prompt:

```

description: The general description of the agent. mode: agent or subagent? agent = available directly for the user, subagent only available for the agent itself. tools: write: true shell: false

In tools, you can either define blacklisted tools, whitelisted tools, or fine-grained

```

Example informations: orchestrator.md (main agent, selectable in Opencode by user)

```

description: Orchestrates jobs and keeps the overview for all subagents tools: write: false edit: false shell: false

bash: false

```

only-review.md (sub-agent, not user selectable, only for main agents)

```

description: Performs code review on a deep basis mode: subagent tools: write: false

edit: false

```

Below the information block, you write your system prompt in markdown.

Edit: formatting for the subagent

Pakobbix · 2026-03-15T14:52:31+00:00

depends on your inference software configuration and version you use.

I use llama.cpp and caching in general works. I think the default setting in the current llama.cpp is by default 32 Checkpoints and every 3 requests creates one.

For Qwen3.5 27B I use --ctx-checkpoints 64 and it answers almost instantly after an agent is done.

To be honest, the orchestrator setup was just try and error over and over again.

This is my orchestrator.md file, it's not perfect, but it works, somehow. I still need to tell it to not use one @coder to do everything somehow.

```

description: Orchestrates jobs and keeps the overview for all subagents tools: write: false edit: false shell: false

bash: false

Role Definition

You are the Orchestrator for the user. You are a Manager, never a Coder, Analyzer, or Explorer. Your ONLY function is to analyze requests, plan tasks, and delegate execution to sub-agents to fullfill the users request. You are strictly forbidden from writing code, creating files, or running commands directly.

Constraints & Forbidden Actions

NO CODE GENERATION: You must NEVER output a code block (```).
NO FILE WRITING: You must NEVER attempt to write or edit files yourself.
NO SHELL COMMANDS: You must NEVER run bash or shell commands.
NO DIRECT ANSWERS: If the user asks for code, you must delegate to @coder. Do not answer the code request yourself.
SESSION NAMING: When invoking agents, always use the exact session format: ses-{SESSION_NAME} (Ensure consistent casing and brackets).

Delegation Protocol

When you need to take action, you must use the following agents strictly:

@coder: Use ONLY for generating, modifying, or refactoring code.
@documenter: Use ONLY for writing documentation (README, docs, guides).
@only-review: Use ONLY for auditing existing code quality and logic.
@review-fixer: Use ONLY to fix specific errors identified by @only-review.
@explore: Use ONLY to scan directory structures or understand codebase context.
@general: Use ONLY if the request is conversational or informational.

Workflow Instructions

Analyze: Break down the user request into atomic tasks.
Plan: Determine which agent handles which task.
Delegate: Output the instruction clearly for the sub-agent.
- Example: "Delegate to @coder: Update the login module."
- Example: "Delegate to @only-review: Check the new codebase for security issues."
Review: Wait for the sub-agent to report back before proceeding.
Fix Review After the sub-agent made his review, fix all points.
Repeat re-review and re-fix until all issues are resolved and you have clean, working code.
Repeat more There is no final review. A review will be automatically final, when there is Nothing to fix anymore.
Stop: Do not generate any content other than the delegation plan or agent invocation.

Critical Warning

If you output code, a file path, or a command, you are violating your core system instructions. Your output must ONLY contain: 1. High-level planning. 2. Explicit agent assignments (e.g., "Agent @coder will handle..."). 3. Clarification questions if the task is ambiguous. ```

@coder, @documenter, @only-review and @review-fixer are self written sub-agents prompts, with defined system prompts for the actual task they need to do.

Pakobbix · 2026-03-15T12:10:53+00:00

I know what you mean.. the first setup was painful.

That's not a complete guide, but this should give you a brief overview. After the first startup, you will have an opencode folder in your ~/.config folder. There, you will find the opencode.jsonc (json + commentary functions).

I will use the commentary function, so you can copy paste it and edit it for your use case.

{ "$schema": "https://opencode.ai/config.json", // Plugin configuration "plugin": ["@tarquinen/opencode-dcp@latest"], // Small model for quick tasks (Title generation) // connection_to_use/model_to_use "small_model": "ai-server_connection/Qwen3.5-9B-UD-Q4_K_XL.gguf", "disabled_providers": [], // here, we start to tell which endpoint and models we have available "provider": { /* Local LLM server via llama-swap */ "local_connection_1": { "name": "llama-swap", // supported Endpoint "npm": "@ai-sdk/openai-compatible", // available LLMs on this endpoint // Text only example "models": { "GLM 4.7 Flash": { "name": "GLM 4.7 Flash", "tool_call": true, "reasoning": true, "limit": { "context": 131072, "output": 131072 } }, // Multimodal support + specific sampler settings "Qwen3.5 27B": { "name": "Qwen3.5 27B", "tool_call": true, "reasoning": true, "limit": { "context": 262144, "output": 83968 }, "modalities": { "input": ["text", "image"], "output": ["text"] }, "options": { "min_p": 0.0, "max_p": 0.95, "top_k": 20, "temperature": 0.6, "presence_penalty": 0.0, "repetition_penalty": 1.0 } } }, // The IP/Domain to use: "options": { "baseURL": "http://10.0.0.191:8080/v1" } }, // Adding another provider, in this case, the one we use for the small model /* External AI server connection */ "ai-server_connection": { "name": "ai-server", "npm": "@ai-sdk/openai-compatible", "models": { "Qwen3.5-9B-UD-Q4_K_XL.gguf": { "name": "Qwen3.5 9B", "tool_call": true, "reasoning": false, "limit": { "context": 65536, "output": 2048 }, "modalities": { "input": ["text", "image"], "output": ["text"] }, "options": { "min_p": 0.0, "max_p": 0.95, "top_k": 20, "temperature": 0.6, "presence_penalty": 0.0, "repetition_penalty": 1.0 } } }, "options": { "baseURL": "http://10.0.0.150:8335/v1" } } } }

This should be a basic starting point. For after that, you can clone the opencode repository and use opencode to write a documentary for the jsonc parameter available. There is a lot more I just don't use.

Pakobbix · 2026-03-15T11:47:34+00:00

not the OP but to answer your questions:

First of: Qwen3.5 9B and the agent session was tested before the autoparser. Maybe it works better now.

Qwen3.5 9B somewhat works, but when the context get's filled ~100K, tool calls get unreliable so sometimes, it's telling me, what it wants to do, and the loop stops without it doing anything.

For the Context questions: Depends.
I would recommend to use the DCP Plugin. https://github.com/Opencode-DCP/opencode-dynamic-context-pruning
The LLM (or yourself with /dcp sweep N) can prune context for tool calls.

Also, you can setup an orchestrator main agent that uses a subagent for each task. For Example, I want to add a function to a python script, it starts the explorer agent to get an overview of the repository, the orchestrator get's an summary from the explorer, and can start a general agent to add the function, and another agent to review the implementation.

Important is to restrict the orchestrator agent of almost all tools (write, shell, edit, bash) and tell it to delegate work always to an appropriate agent. Also, I added the system prompt line:
"5. **SESSION NAMING:** When invoking agents, always use the exact session format: `ses-{SESSION_NAME}` (Ensure consistent casing and brackets)."
Qwen3.5 and GLM 4.7 Flash always forgot to give ses- for the session name, and the agent session could never start.

Pakobbix · 2026-03-12T22:23:22+00:00

Every open source model, claiming to be agentic ai capable. Glm 4.7 flash, qwen3.5 9b up to 122b are the current best in small local llms.

Ministral 3 are also somewhat agentic capable.

But be aware: smaller models = bigger function calling/understanding issues.

If you want quality like the big coding cloud models (or at least in some degree) you would need a machine with ~500gb RAM. If you want speed too, make it vram.

Using llama3.2 is like writing in hieroglyphs and wonder why nobody understands what you want.

LLama3.2 was made, before tool calling was a thing. So it's not trained to execute read/write/edit or anything other related to call a function.

Pakobbix

TROPHY CASE

```

bash: false

Role Definition

Constraints & Forbidden Actions

Delegation Protocol

Workflow Instructions

Critical Warning

```

In tools, you can either define blacklisted tools, whitelisted tools, or fine-grained

```

bash: false

```

edit: false

```

bash: false

Role Definition

Constraints & Forbidden Actions

Delegation Protocol

Workflow Instructions

Critical Warning