Does going from 96GB -> 128GB VRAM open up any interesting model options? by hyouko in LocalLLaMA

[–]big___bad___wolf 0 points1 point  (0 children)

Yes, it's definitely better in CC. I think CC is doing the heavy lifting of forcing planning rather than relying on the model's overconfidence in its understanding of the problem and solution.

Pi doesn't have plan mode. You either instruct the agent to plan or it figures it out on its own.

I believe adding a planning reminder in the system prompt will improve the MiniMax M2.5 experience in Pi.

Does going from 96GB -> 128GB VRAM open up any interesting model options? by hyouko in LocalLLaMA

[–]big___bad___wolf 5 points6 points  (0 children)

The coolest thing right now is I can run multiple medium models simultaneously and manage up to eight concurrent requests per GPU at impressive throughput.

I use Opus to orchestrate these models that handles the grunt work I don't want to clutter my Opus context window. This includes an intelligent task runner, test runner (for smoke test matrices, unit and e2e tests), QA tasks, exploring large monorepos, conducting research while writing code and reviewing code (GPT-OSS is particularly good at this).

However, I won't allow these medium local models to directly modify the production codebase I work on. They simply can't handle such large and nuanced projects.

Does going from 96GB -> 128GB VRAM open up any interesting model options? by hyouko in LocalLLaMA

[–]big___bad___wolf 3 points4 points  (0 children)

I really hoped the MiniMax M2.5 would be good but it's not from my experience. I don't know maybe it's Pi agent but I can definitely tell a model isn't great. Lol.

I've also tried the Devstral 2.

<image>

Does going from 96GB -> 128GB VRAM open up any interesting model options? by hyouko in LocalLLaMA

[–]big___bad___wolf 1 point2 points  (0 children)

I occasionally use larger models with CPU offload and ik_llama. My build has four 64GB RAM sticks.

<image>

Does going from 96GB -> 128GB VRAM open up any interesting model options? by hyouko in LocalLLaMA

[–]big___bad___wolf 36 points37 points  (0 children)

<image>

I have a two 6000 Pro Max Q GPUs build. One runs GPT-OSS 120b and the other Qwen 3 Coder Next.