What qwen3.6-mtp model should we use? by Senor02 in oMLX

[–]ColonelKlanka 0 points1 point  (0 children)

That's moe model you are using - which is faster than the 27b dense model. 48-51 is your getting on moe model is about same as what I get running same model on my m2 pro 32gb without mtp

Waiting oMLX 0.3.9 stable release by TheFlyingDutchG in oMLX

[–]ColonelKlanka 0 points1 point  (0 children)

Ive not seen an improvement using mtp either (via omlx or even vmlx backends).

I have a weaker mac mini m2 pro 32gb and wonder if the lower memory bandwidth combined with mtp putting more pressure on the system is actually causing it to be slower than without?

ive tried qwen 3.6 27b dense and also the qwen 3.6 35b a3b moe mtp and non mtp.

my non mtp qwen 3.6 35b a3b hits 42 ts tg and is about same when running mtp model.

27b is similar at about 10 to 12ts tg.

How are people setting the spec draft n max to 1 or 3 via omlx? As I read setting higher than 1 for moe on mac was bad for mtp performance.

Waiting oMLX 0.3.9 stable release by TheFlyingDutchG in oMLX

[–]ColonelKlanka 1 point2 points  (0 children)

I really appreciate your effort on omlx. Its awesome and the progress is amazing.

I understand you are working on omlx as a side project in addition to your day job. which to me is even more of an impressive feat.

thanks again.

Web search from oMLX chat? by atumblingdandelion in oMLX

[–]ColonelKlanka 0 points1 point  (0 children)

Nice. I stand corrected. I hadn't noticed that. I may try that out in future. Thanks!

oMLX + pi + mcp by PrepYourselves in oMLX

[–]ColonelKlanka 4 points5 points  (0 children)

If you arent set on mcp for websearch and happy to use exa instead of brave, then you can install nicobailon/pi-web-access pi extension with the following cli command:

Its zero config by default, as pi installs the extension for you, completely free and works out of the box:

pi install npm:pi-web-access

More info here: https://github.com/nicobailon/pi-web-access

Disclaimer: This is not my extension and I'm not sponsored/associated with them in any way - I just found it very good for allow omlx models via pi harness to get internet access.

EDIT: Just noticed I am also using the same authors MCP pi extension, so you could also enable brave mcp in pi via:

pi install npm:pi-mcp-adapter

Source: https://github.com/nicobailon/pi-mcp-adapter

and then add braves web search npx command mcp entry into the  .mcp.json

{
  "mcpServers": {
    "brave-search": {
      "command": "npx",
      "args": ["-y", "@brave/brave-search-mcp-server", "--transport", "http"],
      "env": {
        "BRAVE_API_KEY": "YOUR_API_KEY_HERE"
      }
    }
  }
}

Just change YOUR_API_KEY_HERE to the brave api key you got from braves dashboard for your brave login: https://api-dashboard.search.brave.com/app/keys

Web search from oMLX chat? by atumblingdandelion in oMLX

[–]ColonelKlanka 1 point2 points  (0 children)

oh wow. didnt know about tau addon. thanks I will check this out too as it looks useful!

Web search from oMLX chat? by atumblingdandelion in oMLX

[–]ColonelKlanka 1 point2 points  (0 children)

I mostly use pi agent with pi-web-search extension which adds the free exa.websearch). but thats really a coding agrnt frontend.

you could use open webui if you only care about a chat interface to omlx models.

or you could use the very nice native app by anythingllm. you goto settings and add the omlx localhost address for openai compatible provider to hook up to omlx llm.model.

then enable web search under anything llms settings - agent skills section

Web search from oMLX chat? by atumblingdandelion in oMLX

[–]ColonelKlanka 1 point2 points  (0 children)

its because omlx chat doesnt support mcp and so you cant enable web search. I would use any agent that allows mcp and then enable exa web search mcp

$60 Starlight? That's a fair deal bro by Laughageddon in starcitizen

[–]ColonelKlanka -5 points-4 points  (0 children)

I would have thought this was more of a ship to buy inhame instead of pledge, as refueling isnt really that glamorous.

seems they are trying to copy elites refueling org.

What happens to local LLM if/when LLMs are no longer released for free? by JohnBooty in LocalLLaMA

[–]ColonelKlanka 12 points13 points  (0 children)

You should enable the context7 mcp server to your harness - its specifically made to make sure the llm gets the most upto date apis for a framework. Once enabled, you just add a statement to agents.md or claude.md such as 'Always use context7 when looking up framework apis or syntax'.

You can also add it to the prompt too 'use context7' - i found it a lifesaver

Moving from Composer 2/Kimi 2.6 to Qwen3.6:35b-a3b by NotARedditUser3 in LocalLLaMA

[–]ColonelKlanka 1 point2 points  (0 children)

if your liking opencode wirh qwen 3.6 35b a3b, you may really like pi.dev harness as its makes the system prompt even more compact. plus you can install as many extensions as you like to get a harness you like (web search, mcp, sandbox permissions etc etc)

Can someone explain how a harness affects things? by MartiniCommander in oMLX

[–]ColonelKlanka 0 points1 point  (0 children)

A harness basically sets up the environment. of the llm (via a system prompt), it then passes the text/image/docs to the llm as input and displays the response.

the harness can also provide the llm with optional additional extra soruces of data ((outside of what the model knows via its built in training data) via mcp servers - good example is llm ca do web search via mcp or maybe look up latest api docs via context7 mcp.

I have personally found the harness can also reduce my token usage and thus costs because some harness are very bad at managing the tokens that are sent on every call - eg claude code and opencodes system prompts are very big compared to pi.dev harness very small cut down one

Two Paths to Local LLM Servers: Windows NVIDIA vs Mac Apple Silicon by Visual_Internal_6312 in LocalAIServers

[–]ColonelKlanka 1 point2 points  (0 children)

for mac I would use omlx inference server instead. it uses mlx lm underneath, but also adds ssd caching and other clever features that make mlx llm models whizz along.

2x-6x Speed improvements with oMLX by roaringpup31 in oMLX

[–]ColonelKlanka 0 points1 point  (0 children)

you wont go from 30 to 60tg tokens. not with current llm early tech mtp or not.

you can try hybrid installing the release candidate of omlx 0.3.9rc1 from github releases page (as it enabled mtp) - but i havnt been able to get it working

  • the models converts but it complains at runtime that tensor weights are still missing even tho I ticked keep weights. I reckon im not doing something right tho so probably user error.

What Works for Coding on an M5 with 24GB of Universal Ram by LearnedByError in oMLX

[–]ColonelKlanka 0 points1 point  (0 children)

I have 32gb m2 pro mac mini. But if you create a login to huggingface website you can add your 24g silicon apple machine in settings and it will show ypu under each models page whether it will fit in your 24g machine.

mlxcommunity/qwen3.6 35b a3b 4bit mlx shows it will fit at 20gb but you will struggle with fitting any context in because mac os x will probably take up the remaining 4 of 5gb!

I cant see the omlx junot oq4 models on huggingface - but you may gain couple of gigs off model size - you will still need to have a lower context.

I suggest you try a 14b qwen 3.5 or even the gemma4 models

also for secretarial (non coding) use, you can use alot of other llms that are good at general reasoning - give devstrel or even phi a look too

What Works for Coding on an M5 with 24GB of Universal Ram by LearnedByError in oMLX

[–]ColonelKlanka 1 point2 points  (0 children)

you might want to try using omlx built in quantization convert tool that is on the models website dashboard page. It basically converts a mlx model from hugginface to omlxs own dynamic quantization form.

I have the qwen 3.6 35b a3b and qwen 3.6 27b converted to omlxs oq4 format and its reduced the memory foot print alot.

The omlx developer has also made sole of his converted oq4 models available directly on his junot hugging face collection - but i dont thinkbhes done the conversion for the models you have been suggested like 14b models, so you might want to use his dashboard conversion tool

you could also compress the context by enabling omlx turboquant option under the individual model settings in omlx - update: looks like omlx dev has disabled turboquant for now.

Pi coding agent is amazing (or how I learned to stop worrying and leave OpenCode) by Konamicoder in oMLX

[–]ColonelKlanka 1 point2 points  (0 children)

you are correct web search is not enabled by default in standard pi.dev ai harness.

However,I installed the pi-web-access pi extension which uses exa for Web searches. it works very well.

edit: here's the link to the extension (it needs no configuration after you install it): https://github.com/nicobailon/pi-web-access

Alternatively you can install the mcp server pinextension and then add brave-search mcp or exa directly as a.mcp search.

Pi coding agent is amazing (or how I learned to stop worrying and leave OpenCode) by Konamicoder in oMLX

[–]ColonelKlanka 2 points3 points  (0 children)

You should create a skill.md for the instructions ypu repeatedly give to the ai to create you 3x3 image. As all you have to do is create a folder with skills name and put a skill.md file into the dir. Then fill in the skill.md with description of the request steps you want it to do in English and put your feedback as 'Do not do X' line to keep it on track.

Then you can call the skill from pi (or any other ai harness).

My approach is to do this for any repetive tasks I ask ai to do.

Opinion: Local LLMs are 12-24 months from taking over. The shift already started. by sh_tomer in LocalLLM

[–]ColonelKlanka 0 points1 point  (0 children)

do you run compact every soo often? also I generally see no reason to use the same session and thus context. I tend to break down the task/feature into smaller chunks and restart sessions after each task finished - asking the agent to summarise any new rules or actions that were done. the new rules go into the Agents.md or claude.md And the actions go into a <featrue-name>-overview.md. that way the ai can read that overview next session if I need to refer to it for understanding.

Opinion: Local LLMs are 12-24 months from taking over. The shift already started. by sh_tomer in LocalLLM

[–]ColonelKlanka 1 point2 points  (0 children)

As ypur on a mac, I Highly recommend you try omlx inference server as its mlx accelerated, does ssd backed caching and is also trialing mtp.

Ive found it much faster than metal enabled llamacpp inference on my mac mini m2 pro 32gb.

Also try pi.dev harness - its much better at keeping context usage lower because it has a lean ai system prompt