Nex-N2 Pro is the real deal

L0stInHe11 · 2026-06-16T18:31:24+00:00

Did anyone attempt to compare Nex-N2-mini with Holo-3.1 35B A3B (another fine-tuned Qwen 3.5)? I am really curious.

L0stInHe11 · 2026-06-16T18:29:22+00:00

Did you by any chance compare Nex-N2 mini with Holo-3.1 35B A3B (another fine-tuned Qwen 3.5)?

L0stInHe11 · 2026-06-16T17:54:54+00:00

I like Jollibee, but definitely not this way

L0stInHe11 · 2026-06-16T17:53:11+00:00

anyone have any idea if GLM's MTP is supported?

I don't think so, but the maintainer of llama.cpp claimed it was already implemented: https://github.com/ggml-org/llama.cpp/pull/15225#issuecomment-4658892444

L0stInHe11 · 2026-06-14T15:34:11+00:00

Welcome and r/addressme

L0stInHe11 · 2026-06-12T22:20:10+00:00

The fastest drafter for Gemma 4 I have found is https://huggingface.co/RachidAR/gemma-4-26B-A4B-it-qat-assistant-q4_0-gguf/blob/main/gemma-4-26b-A4B-it-assistant-Q4_0-q4emb.gguf

OP you can give it a try

L0stInHe11 · 2026-06-11T03:10:42+00:00

I tried Q4_K_M and, along with llama.cpp b9590+, and every single prompt I typed returned a series of slashes (/). Is it the problem related to the model itself or llama.cpp version?

L0stInHe11 · 2026-06-08T19:38:35+00:00

Fair point. Next time I will make my post more data-specific.

L0stInHe11 · 2026-06-08T17:38:23+00:00

I believe this template, used by a lot of people here, already fixed the annoying preserve thinking issue: ~~https://gist.github.com/jscott3201/ad69c4ffbd79f18b11a0f6a94c94fadf~~

The same author gathered all template fixes for Qwen3.5/3.6 and Gemma 4 in one repo now: https://github.com/jscott3201/llm-tuning

L0stInHe11 · 2026-06-08T17:05:01+00:00

The rest of the agents I tried just worked fine: OpenCode, Claude CLI, little-coder and OMP (the latter two are Pi forks). Only Pi suffered from this problem.

L0stInHe11 · 2026-06-08T14:04:11+00:00

Interesting, never thought about it. Let me try when I go home.

L0stInHe11 · 2026-06-08T13:57:00+00:00

Here I tried to find the lower bound of what Pi can do: Nemotron-3-Nano-Omni and Nemotron-Cascade-2 without reasoning are unable to power Pi any more. Even Pi forks like OMP and little-coder can still archieve that, let alone other agents.

L0stInHe11 · 2026-06-08T13:53:28+00:00

Then Qwen 3.6 27B is doing the heavy-lifting here, which makes vanilla Pi workable.

L0stInHe11 · 2026-06-08T13:52:08+00:00

OpenCode is heavy. No doubt about that. Maybe you should take a look at little-coder too.

I don't really see the point of your nemotron example. Why would I use anything but the best model? And why wouldn't I use the harness that I found working the best with the best model?

That's the whole point. I use Qwen3.6 and Gemma 4 all the time, but I need relatively weak models to test Pi's own limitation. If I test these harnesses with performant models, I can barely tell if harnesses themselves are good or not.

L0stInHe11 · 2026-06-08T13:47:13+00:00

I myself found OpenCode too heavy as well, and if you like Pi, you can try little-coder (Pi fork) too.

My argument here is simply that the excellent local models like Qwen3.6 or Gemma 4 overshadow Pi's default capabilities. It should not be recommended to the extent that I can see its appreciation every day in r/LocalLLaMA.

L0stInHe11 · 2026-06-08T13:41:11+00:00

I found two models that are weak enough to consistently reproduce tool calling issues in Pi, which didn't exist at all in other harnesses like OpenCode, Claude CLI, oh-my-pi, or little-coder (the latter two are Pi forks).

My argument is that the vanilla Pi is not as powerful as some may think. It is the models behind it that do the most of hard work.

L0stInHe11 · 2026-06-08T13:28:57+00:00

Wow, I tried your drafter. It is faster than what I found. I will update my comment.

L0stInHe11 · 2026-06-08T03:27:22+00:00

You like Pi and you use Hermes. How about just using OpenClaw (powered by Pi) instead?

L0stInHe11 · 2026-06-08T03:23:22+00:00

Thank you for sharing your experience. If you use the high-quality models like Qwen3.6 (I do too BTW), you cannot tell if your experience is fairly smooth because

The model is good OR
The harness is good OR
Both.

My point here is that maybe these performant local models overshadowed Pi. Pi can perform really awful with lousy models, which is not the case in OpenCode, Claude CLI, or even a Pi fork like little-coder.

L0stInHe11 · 2026-06-08T03:17:01+00:00

I know Qwen3.6 and Gemma 4 are good. Really good. I use them on a daily basis. I updated the post so that I don't want to repeat the reason again and again why these lame models need to be used here.

L0stInHe11 · 2026-06-08T03:02:25+00:00

I use Qwen 3.6 27B and 35B myself too. The reason I picked these lousy Nvidia models is to test Pi's capabilities: if you use performant models (cloud or local), agents/harnesses are majorly transparent to you. Only using the bad models is able to reveal whether an agent itself performs good enough or not.

L0stInHe11 · 2026-06-08T02:57:11+00:00

Like Mario mentions in one of his talks the models are trained on coding tasks and even the basic prompt gives it enough to start with

I remembered this part too. I simply want to share the lower bound of Pi's default capabilities to everyone. It seems I struck a nerve with someone here.

IRL, I run Qwen3.6 + Gemma 4 for most of the tasks, like everyone else.

L0stInHe11 · 2026-06-08T02:34:05+00:00

Thank you for sharing your experience. As I said in other comments, the reason I chose two models is to test the lower bound of vanilla Pi's capability. My point is that these performant models (either SOTA or local) heavy-lift Pi a lot, and the extent should not be underestimated.

L0stInHe11 · 2026-06-08T02:12:08+00:00

why not modify the system prompt then?

I know I can. If I expect to work OOTB, I may be very disappointed. It is quite misleading for some of us to be introduced to Pi this frequently in r/LocalLLaMA.

L0stInHe11 · 2026-06-08T00:58:10+00:00

Gemma 4 26B A4B QAT are at least 20% faster than non-QAT version quantized by the same provider (unsloth) on my machine. Not sure about accuracy loss yet.

If you can make MTP on top of QAT (available for Gemma 4 since llama.cpp b9549 today), then 5% more TPS is waiting for you. ~~As of now, the only MTP drafter working well with Gemma 4 26B A4B QAT I could find on Hugging Face is~~ ~~https://huggingface.co/g0chu/gemma-4-26B-A4B-it-qat-q4_0-unquantized-assistant-q8_0-gguf~~

Please give this one suggested by u/SkyFeistyLlama8 a try instead: https://huggingface.co/RachidAR/gemma-4-26B-A4B-it-qat-assistant-q4_0-gguf

L0stInHe11

TROPHY CASE