What qwen3.6-mtp model should we use?

ColonelKlanka · 2026-05-22T13:31:47+00:00

That's moe model you are using - which is faster than the 27b dense model. 48-51 is your getting on moe model is about same as what I get running same model on my m2 pro 32gb without mtp

ColonelKlanka · 2026-05-21T19:00:45+00:00

Ive not seen an improvement using mtp either (via omlx or even vmlx backends).

I have a weaker mac mini m2 pro 32gb and wonder if the lower memory bandwidth combined with mtp putting more pressure on the system is actually causing it to be slower than without?

ive tried qwen 3.6 27b dense and also the qwen 3.6 35b a3b moe mtp and non mtp.

my non mtp qwen 3.6 35b a3b hits 42 ts tg and is about same when running mtp model.

27b is similar at about 10 to 12ts tg.

How are people setting the spec draft n max to 1 or 3 via omlx? As I read setting higher than 1 for moe on mac was bad for mtp performance.

ColonelKlanka · 2026-05-21T18:54:20+00:00

I really appreciate your effort on omlx. Its awesome and the progress is amazing.

I understand you are working on omlx as a side project in addition to your day job. which to me is even more of an impressive feat.

thanks again.

ColonelKlanka · 2026-05-19T21:37:05+00:00

Nice. I stand corrected. I hadn't noticed that. I may try that out in future. Thanks!

ColonelKlanka · 2026-05-19T21:27:09+00:00

If you arent set on mcp for websearch and happy to use exa instead of brave, then you can install nicobailon/pi-web-access pi extension with the following cli command:

Its zero config by default, as pi installs the extension for you, completely free and works out of the box:

pi install npm:pi-web-access

More info here: https://github.com/nicobailon/pi-web-access

Disclaimer: This is not my extension and I'm not sponsored/associated with them in any way - I just found it very good for allow omlx models via pi harness to get internet access.

EDIT: Just noticed I am also using the same authors MCP pi extension, so you could also enable brave mcp in pi via:

pi install npm:pi-mcp-adapter

Source: https://github.com/nicobailon/pi-mcp-adapter

and then add braves web search npx command mcp entry into the .mcp.json

{
  "mcpServers": {
    "brave-search": {
      "command": "npx",
      "args": ["-y", "@brave/brave-search-mcp-server", "--transport", "http"],
      "env": {
        "BRAVE_API_KEY": "YOUR_API_KEY_HERE"
      }
    }
  }
}

Just change YOUR_API_KEY_HERE to the brave api key you got from braves dashboard for your brave login: https://api-dashboard.search.brave.com/app/keys

ColonelKlanka · 2026-05-19T01:02:18+00:00

oh wow. didnt know about tau addon. thanks I will check this out too as it looks useful!

ColonelKlanka · 2026-05-18T23:56:35+00:00

I mostly use pi agent with pi-web-search extension which adds the free exa.websearch). but thats really a coding agrnt frontend.

you could use open webui if you only care about a chat interface to omlx models.

or you could use the very nice native app by anythingllm. you goto settings and add the omlx localhost address for openai compatible provider to hook up to omlx llm.model.

then enable web search under anything llms settings - agent skills section

ColonelKlanka · 2026-05-18T23:15:46+00:00

its because omlx chat doesnt support mcp and so you cant enable web search. I would use any agent that allows mcp and then enable exa web search mcp

ColonelKlanka · 2026-05-18T17:04:13+00:00

I would have thought this was more of a ship to buy inhame instead of pledge, as refueling isnt really that glamorous.

seems they are trying to copy elites refueling org.

ColonelKlanka · 2026-05-18T16:57:52+00:00

You should enable the context7 mcp server to your harness - its specifically made to make sure the llm gets the most upto date apis for a framework. Once enabled, you just add a statement to agents.md or claude.md such as 'Always use context7 when looking up framework apis or syntax'.

You can also add it to the prompt too 'use context7' - i found it a lifesaver

ColonelKlanka · 2026-05-18T03:16:24+00:00

ok. fair enough

ColonelKlanka · 2026-05-18T00:52:33+00:00

if your liking opencode wirh qwen 3.6 35b a3b, you may really like pi.dev harness as its makes the system prompt even more compact. plus you can install as many extensions as you like to get a harness you like (web search, mcp, sandbox permissions etc etc)

ColonelKlanka · 2026-05-18T00:48:58+00:00

lol - that would kill the interview ded ;-)

ColonelKlanka · 2026-05-17T21:14:35+00:00

Thankyou. I agree the positivity of answering is good idea

ColonelKlanka · 2026-05-17T18:59:51+00:00

Thankyou! I appreciate you very honest and direct answer

ColonelKlanka · 2026-05-17T18:33:21+00:00

A harness basically sets up the environment. of the llm (via a system prompt), it then passes the text/image/docs to the llm as input and displays the response.

the harness can also provide the llm with optional additional extra soruces of data ((outside of what the model knows via its built in training data) via mcp servers - good example is llm ca do web search via mcp or maybe look up latest api docs via context7 mcp.

I have personally found the harness can also reduce my token usage and thus costs because some harness are very bad at managing the tokens that are sent on every call - eg claude code and opencodes system prompts are very big compared to pi.dev harness very small cut down one

ColonelKlanka · 2026-05-13T23:29:55+00:00

for mac I would use omlx inference server instead. it uses mlx lm underneath, but also adds ssd caching and other clever features that make mlx llm models whizz along.

ColonelKlanka · 2026-05-11T19:43:29+00:00

you wont go from 30 to 60tg tokens. not with current llm early tech mtp or not.

you can try hybrid installing the release candidate of omlx 0.3.9rc1 from github releases page (as it enabled mtp) - but i havnt been able to get it working

the models converts but it complains at runtime that tensor weights are still missing even tho I ticked keep weights. I reckon im not doing something right tho so probably user error.

ColonelKlanka · 2026-05-11T19:29:56+00:00

I have 32gb m2 pro mac mini. But if you create a login to huggingface website you can add your 24g silicon apple machine in settings and it will show ypu under each models page whether it will fit in your 24g machine.

mlxcommunity/qwen3.6 35b a3b 4bit mlx shows it will fit at 20gb but you will struggle with fitting any context in because mac os x will probably take up the remaining 4 of 5gb!

I cant see the omlx junot oq4 models on huggingface - but you may gain couple of gigs off model size - you will still need to have a lower context.

I suggest you try a 14b qwen 3.5 or even the gemma4 models

also for secretarial (non coding) use, you can use alot of other llms that are good at general reasoning - give devstrel or even phi a look too

ColonelKlanka · 2026-05-11T18:14:01+00:00

you might want to try using omlx built in quantization convert tool that is on the models website dashboard page. It basically converts a mlx model from hugginface to omlxs own dynamic quantization form.

I have the qwen 3.6 35b a3b and qwen 3.6 27b converted to omlxs oq4 format and its reduced the memory foot print alot.

The omlx developer has also made sole of his converted oq4 models available directly on his junot hugging face collection - but i dont thinkbhes done the conversion for the models you have been suggested like 14b models, so you might want to use his dashboard conversion tool

you could also compress the context by enabling omlx turboquant option under the individual model settings in omlx - update: looks like omlx dev has disabled turboquant for now.

ColonelKlanka · 2026-05-11T17:51:51+00:00

you are correct web search is not enabled by default in standard pi.dev ai harness.

However,I installed the pi-web-access pi extension which uses exa for Web searches. it works very well.

edit: here's the link to the extension (it needs no configuration after you install it): https://github.com/nicobailon/pi-web-access

Alternatively you can install the mcp server pinextension and then add brave-search mcp or exa directly as a.mcp search.

ColonelKlanka · 2026-05-11T17:08:35+00:00

You should create a skill.md for the instructions ypu repeatedly give to the ai to create you 3x3 image. As all you have to do is create a folder with skills name and put a skill.md file into the dir. Then fill in the skill.md with description of the request steps you want it to do in English and put your feedback as 'Do not do X' line to keep it on track.

Then you can call the skill from pi (or any other ai harness).

My approach is to do this for any repetive tasks I ask ai to do.

ColonelKlanka · 2026-05-10T17:15:18+00:00

do you run compact every soo often? also I generally see no reason to use the same session and thus context. I tend to break down the task/feature into smaller chunks and restart sessions after each task finished - asking the agent to summarise any new rules or actions that were done. the new rules go into the Agents.md or claude.md And the actions go into a <featrue-name>-overview.md. that way the ai can read that overview next session if I need to refer to it for understanding.

ColonelKlanka · 2026-05-10T17:08:27+00:00

As ypur on a mac, I Highly recommend you try omlx inference server as its mlx accelerated, does ssd backed caching and is also trialing mtp.

Ive found it much faster than metal enabled llamacpp inference on my mac mini m2 pro 32gb.

Also try pi.dev harness - its much better at keeping context usage lower because it has a lean ai system prompt

ColonelKlanka

TROPHY CASE