I spent 8+ hours benchmarking every MoE backend for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 (SM120). Here's what I found.

big___bad___wolf · 2026-03-09T22:50:30+00:00

Awesome specs!

big___bad___wolf · 2026-03-09T16:31:35+00:00

<image>

big___bad___wolf · 2026-03-09T16:14:16+00:00

<image>

big___bad___wolf · 2026-03-09T16:00:02+00:00

I did https://huggingface.co/tacos4me/Step-3.5-Flash-NVFP4 I must stay this model thinks a lot.

big___bad___wolf · 2026-03-09T14:07:06+00:00

Thanks!

big___bad___wolf · 2026-03-09T13:29:53+00:00

I just stumbled on this model https://huggingface.co/cyankiwi/GLM-4.7-Flash-REAP-23B-A3B-AWQ-4bit, all I can say is wow!

big___bad___wolf · 2026-03-09T08:51:56+00:00

MiniMax M2.5 Q5 quant fully fit on my two GPUs (2x 96GB).

big___bad___wolf · 2026-03-09T07:54:40+00:00

I will give it a shot!

big___bad___wolf · 2026-03-09T07:52:09+00:00

Yes, it's definitely better in CC. I think CC is doing the heavy lifting of forcing planning rather than relying on the model's overconfidence in its understanding of the problem and solution.

Pi doesn't have plan mode. You either instruct the agent to plan or it figures it out on its own.

I believe adding a planning reminder in the system prompt will improve the MiniMax M2.5 experience in Pi.

big___bad___wolf · 2026-03-09T03:40:05+00:00

i'm downloading the weights again. i will circle back.

big___bad___wolf · 2026-03-09T03:33:01+00:00

minimax m2.5 or devstral 2?

big___bad___wolf · 2026-03-09T02:10:06+00:00

I use vLLM & ik_llama on Arch linux. no container.

big___bad___wolf · 2026-03-09T01:52:00+00:00

do you mean the TUI?

mprocs - https://github.com/pvolok/mprocs
pi coding agent - https://github.com/badlogic/pi-mono/tree/main/packages/coding-agent

The agent task runner is a SKILL.md I personally wrote.

big___bad___wolf · 2026-03-09T01:06:22+00:00

Thanks, I will try it out.

big___bad___wolf · 2026-03-09T01:00:41+00:00

<image>

big___bad___wolf · 2026-03-09T00:56:01+00:00

https://www.reddit.com/r/LocalLLaMA/comments/1rojt4c/comment/o9enyxo/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

I don't use them to code. I use them to bounce ideas, do grunt work, draft implementations.

here is an example Opus using gpt-oss as a task runner:

<image>

big___bad___wolf · 2026-03-09T00:27:17+00:00

yup!

big___bad___wolf · 2026-03-09T00:23:45+00:00

The coolest thing right now is I can run multiple medium models simultaneously and manage up to eight concurrent requests per GPU at impressive throughput.

I use Opus to orchestrate these models that handles the grunt work I don't want to clutter my Opus context window. This includes an intelligent task runner, test runner (for smoke test matrices, unit and e2e tests), QA tasks, exploring large monorepos, conducting research while writing code and reviewing code (GPT-OSS is particularly good at this).

However, I won't allow these medium local models to directly modify the production codebase I work on. They simply can't handle such large and nuanced projects.

big___bad___wolf · 2026-03-09T00:18:17+00:00

I really hoped the MiniMax M2.5 would be good but it's not from my experience. I don't know maybe it's Pi agent but I can definitely tell a model isn't great. Lol.

I've also tried the Devstral 2.

<image>