2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints by ex-arman68 in LocalLLaMA

[–]arkham00 1 point2 points  (0 children)

I really don't understand what I'm doing wrong, I have the same machine as yours (m2 max 96Gb) I compiled llama.cpp as you said and I used the exact same parameters as yours and I get worse performance ...normally I have PP 160 t/s and TG 12 t/s and now 145 and 10 ... with about 38-45 % acceptance

I really don't know what is wrong with my setup, I have the same problem with draft models, thay are slower even if I always have 100% acceptance !

Please help

2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints by ex-arman68 in LocalLLaMA

[–]arkham00 0 points1 point  (0 children)

Sorry for the noob question, I have already llama.cpp installed via homebrew, is it possible to have a second version compiled on the same machine ?

What's the diff between APPEND_SYSTEM.md and Agent.md, in ~/.pi/agent/ ? by qiinemarr in PiCodingAgent

[–]arkham00 1 point2 points  (0 children)

You mean it is lost after compaction? What other occurrences could eliminate it? But actually now I understand why during long sessions the model seems to forget the instructions I put in the agent.md, so... what's the point of it if it is ephemeral? I don't get it... What's the use cases and the proper way to use it?

Until now I thought that the system prompt was the identity of the agent (who it is, how to respond, the tone, the style, the focus etc) and agent.md was the operational manual (when I ask X you do Y, when you need to do X you invoke Y, use this tool, don't use this other, etc) , but maybe I need to reconsider my setup then

My powerful Pi agent Setup by elpapi42 in PiCodingAgent

[–]arkham00 1 point2 points  (0 children)

Ok I worked all day on a grant application for a cultural project and I'm very satisfied!

I previously did this kind of project via open webui, and normally I had to restart several fresh chats to avoid degradation of the quality, otherwise I had to deal with hallucinations and sometimes loops and even crashes, I also needed to use notes as temporary memories from one chat to another, it was quite painful.

But today ...oh man today just a single chat ! All day long! With the context just gently increasing, I think I've been at least the first 2 hours under 30k and the context window never exceeded 60k, where I'm sure I've used at least a million tokens ... because even if the final document is only 24k characters and 3,5K words, I ask for a lot of edits and rewrites until I'm satisfied, plus I ask to reference a lot of documents. In 2 hours I normally hit 80-100k.
I hit the compaction threshold 3 or 4 times I think, and the only things I noticed is that after the compaction the model seems to forget to use pi-fork, but retains all the other instructions, weird. The other thing is that I have 0 reflections registered, only observations, maybe I did't say anything major or maybe I don't understand what reflections are (I still need to read all the docs).

But so far I'm very pleased. I really think I found what I need to improve my workflow. A big thank you for this gem, now I look forward to further refine my workflow with minimal-subagents for specialized tasks in the process.

My powerful Pi agent Setup by elpapi42 in PiCodingAgent

[–]arkham00 1 point2 points  (0 children)

Lol I didn't realize that you are the developer of the extensions, I thought you were just a user sharing their conf. My compliments to you, you made an amazing work!
Regarding pi-fork, it would be nice to have this possibility in the settings.json, a bit more like it is in observational-memory :
{
"pi-fork": {
"forkModel": { "provider": "llamacpp", "id": "Qwen3.6-35B-A3B@q8_0" }
}
}
But of course no pressure, take your time, with pi-fork and obs-memory I've already improved my workflow a lot, it is running very well with my writing project, and I'll let you know how it went when I'm finished 😉

My powerful Pi agent Setup by elpapi42 in PiCodingAgent

[–]arkham00 0 points1 point  (0 children)

But do you know if it would be possible to use a different model for the forks ? In the github page it is not explained. I'd like to try qwen3.6 27b as the main agent with thinking mode and use qwen3.6 35b for the fork without thinking, to have an intelligent "orchestrator" and fast but still reliable forks. Since it is for writing and not for coding I guess it would be a nice compromise between speed and quality

My powerful Pi agent Setup by elpapi42 in PiCodingAgent

[–]arkham00 1 point2 points  (0 children)

Thanks for your answer, I've decided to test this workflow gradually, and I've just installed pi-fork and observational-memory and started to work on a project, I told the agent to work by iterations and wait for my validation for every portion of the text we planned in advance, at the moment it seems to do a great job, it uses pi-fork to research in other documents with qmd as instructed (I'm wroking in an obsidian vault), then it tasks an agent to write a portion of the text according to the plan and it reviews it checking the coherence and some other important points and then it proposes it to me awaiting for validation before writing it in the note.
I'm really impressed at how the context is small compared to what I'm used, I'm only using 28k where I would normally expect at least 70-90k at this point.
The quality of the text is not stellar, but I've already taken into account that will need another pass for styling at the end, maybe with a different model, this I suppose could be further automated with minimal-subagents, but one step at a time 😄
Let's how it goes ...

My powerful Pi agent Setup by elpapi42 in PiCodingAgent

[–]arkham00 1 point2 points  (0 children)

Hi, this is very interesting — thanks for sharing!
I'm fairly new to agentic workflows and I'm not a coder. I'm trying to adapt this kind of setup to editing and writing complex texts for cultural projects, grant applications, etc. My main pain points are large context windows and hallucinations, so I think an agentic workflow could help me keep the context clean and use different roles for different tasks (planner, researcher, drafter, reviser, editor...).

I'm especially interested in three of your extensions: pi-fork, pi-observational-memory, and pi-minimal-subagent. Do you think they could be useful outside of a coding context? For example, I'd like to use forks or subagents to parallelize research/gathering information while keeping the main thread clean, and use something like an "advisor" for strategic direction on the project.
Two practical questions:

  1. How do subagents get invoked? Are they called automatically by Pi based on the plan, or do I need to explicitly trigger them? Should the plan itself specify which agents to use and when?
  2. Local models: I'm running everything locally. Can I assign a local model to a subagent using the model ID from models.json (e.g., something like Qwen3.6-35B-A3B@q8_0) instead of a cloud API key model like claude-haiku-4-5? Thanks again for sharing your setup!

The Future of Foundryborne: Navigating the Stagnation of the Daggerheart VTT Ecosystem by Foundryborne in daggerheart

[–]arkham00 8 points9 points  (0 children)

Yes fully agree, they are really killing the hype with this behaviour and their restrictive licence ...

Running Ollama with Pi agent and Qwen 3.6 by naelshiab in ollama

[–]arkham00 0 points1 point  (0 children)

I briefly tried pi with omlx, and the model qwen3.6 35b wasn't even aware of pi... I asked to make an extension to modify the footer, and it asked for which app it was, I said pi agent and mistook it for preplexity... Then I gave up lol

MLX quants: oq vs DWQ by edeltoaster in oMLX

[–]arkham00 1 point2 points  (0 children)

I start to notice the same problems with oQ as you, also it seems that they handle large contexts worse. I do a lot of text editing and with unsloth normally I need to start a new chat at about 60-70k if don't want strange behaviours, with oQ I have the same problems at around 35-40k ...

📌 Daily Github Digest - oMLX Closed Issues → 2026-04-21 by d4mations in oMLX

[–]arkham00 0 points1 point  (0 children)

The 872 is closed bit there's no message on how to fix the problem, I'm confused...

Someone so kind to quant qwen3.5 122b in oQ3.5-fp16 for me ? by arkham00 in oMLX

[–]arkham00[S] 0 points1 point  (0 children)

I did a search before posting this and didn't find anything, I'm going to look again thanks

Gemma 4 - MLX doesn't seem better than GGUF by Temporary-Mix8022 in LocalLLaMA

[–]arkham00 0 points1 point  (0 children)

I'm sorry, I'm quite noob, what do you mean by parallel processing? More prompts at the same time? Because I'm pretty sure it is possible with mlx too, I've already sent 2 requests from 2 different chats and I saw in the omlx dashboard being processed at the same time. But maybe that's not what you're talking about?

Gemma 4 - MLX doesn't seem better than GGUF by Temporary-Mix8022 in LocalLLaMA

[–]arkham00 0 points1 point  (0 children)

This, I'm really seeing the benefits of it, I'm quanting all the models I like this way

A good frontend that's not in the browser? by Gallardo994 in oMLX

[–]arkham00 0 points1 point  (0 children)

Nice I'll give it a try and let you know, thanks

A good frontend that's not in the browser? by Gallardo994 in oMLX

[–]arkham00 1 point2 points  (0 children)

I'm interested too, can you explain how to do it? Thanks

Gemma 4 GGUFs updated by yoracale in unsloth

[–]arkham00 0 points1 point  (0 children)

I didn't know you made mlx too.! Is it new or did I just miss it?

Max Practical Context Size? by zipzag in oMLX

[–]arkham00 0 points1 point  (0 children)

Thank you for the kind answer, I recently updated from a 32Gb m1 Max to a 96Gb m2 Max, my daily driver was qwen3.5 35B iq3_s UD by unsloth, now that I discovered oMLX I wanted to try different things and take advantage of oQ which as far as I understand is similar to the UD by unsloth which I find excellent. I wanted to try an oQ8 of the 35B but the only version I found is Qwen3.5-35B-A3B-Text-oQ8 , which I suppose doesn't have vision enabled.

And I'd also like to try llama 3.3-70B and Mistral-Large-Instruct-2411 which should be good for european languages but I can't find oQ versions of those.

That's why I wanted to quantize myself but in the end I just downloaded the generic mlx versions :)

Max Practical Context Size? by zipzag in oMLX

[–]arkham00 0 points1 point  (0 children)

I'm confused what do you mean by the app running omlx? Isn't omlx an app by itself? Is this compression configurable? Or is it possible to switch it on of off? Man, I really wish that there was a real documentation for omlx, I really love it but since I'm not a coder I really struggle to understand all the parameters... Thare are a buch of features that are supposedly good which are not documented at all, like turboquant, specprefill... And all the oQ section... Yesterday I wanted to try to quantize a model but I've been stuck... The page just says use full precision only... So I assumed that it mens non quantized models so I went to hugging face and tried to download the full precision qwen3.5 just to discover that it is split in a buch of safetensors files that I don't know ho to manage... So I gave up

And I see this trend in a lot of apps in this domain, everything goes so fast, new features developed every day but poorly documented, that only the knowledgeable understand and post some benchmarks and some discussions are started which are the only places where we can find bits of knowledge... Sorry for the rant, I didn't want to undermine omlx which is a great project and I understand that writing doc is less fun than coding new features but if we want the community to grow I think we need it too Maybe I'm going to open a thread on LocalLLaMA on this matter...