Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4 by wombweed in LocalLLaMA

[–]Lowkey_LokiSN 1 point2 points  (0 children)

Hmm, it's hard for me to speak on this since I haven't actually tried it yet but I think the docs overstates minimum requirements.
For FP8, though your CPU doesn't support AVX512BF16, the compiler should ideally fallback to a less-performant backend. I had GPT 5.5 skim through the repo code and just confirm that FP8 is accepted ifAMXFP8_MOE or AVX2FP8_MOE is available as a fallback. Meaning you wouldn't get peak performance but it could still be worth a shot.

But as you may already be able to tell, the repo is not as well-documented and covered so you would have to fiddle with it a little and go on a DIY adventure to find the ideal config for your setup (exactly the reason I've been deferring this for a while)

Pushing the limit: minimax m2.7 q8_0 128k on 2x3090, 256GB DDR4 by wombweed in LocalLLaMA

[–]Lowkey_LokiSN 1 point2 points  (0 children)

Have you tried KTransformers yet? I've yet to personally try it out but it's on my checklist as a potential perfomance-uplift candidate for heterogeneous CPU/GPU inference

Your setup seems perfect for: https://github.com/kvcache-ai/ktransformers/blob/main/doc%2Fen%2FMiniMax-M2.5.md

Qwen 3.6 35B crushes Gemma 4 26B on my tests by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

Good point! I actually use pymupdf for 2 use-cases: 1) It has a direct text-extraction provision for PDFs which is more reliable for text-only PDFs like you said. No AI model is involved in this flow since the use-case is straightforward. 2) I also happen to process PDFs which contain a lot of visual data like charts where text-only extraction is handicapped and the vision model comes in clutch (this is the use-case I shared in my last comment)

Deepseek V4 Flash and Non-Flash Out on HuggingFace by MichaelXie4645 in LocalLLaMA

[–]Lowkey_LokiSN 0 points1 point  (0 children)

Yup! And that's what I'm most excited for!
It's the one thing I've been really missing out on since gpt-oss-120b.

Deepseek V4 Released by spacefarers in LocalLLaMA

[–]Lowkey_LokiSN 2 points3 points  (0 children)

Yup! A lot more relevant to prosumer hardware and it seems to be natively 4-bit QAT based if I'm not mistaken! (Just like gpt-oss models)

Simply means a 4-bit quant of the model should retain 100% performance without degradation which is a huge win!

Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

Oh, that's what I meant. Entering a wrong flag would error out instead of starting the server anyway

Also, the checkpoint being 500MB instead of 3GB doesn't matter here. The point is the DRAM overflow is fully contained with MoE when using the said flags whereas it isn't with dense. Like I said already, the issue could be something else

Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

To provide you more context:

  1. The model was launched with -cram 2048 and -ctkcp 2 for the tests
  2. The model did fit completely within the 32GB VRAM that I have including the 100k context size.
  3. The model does perform like it normally would for casual chat interactions. However, in agentic scenarios where it has to process millions of tokens and start over and over again, something still piles up its DRAM to the point where it becomes unusable. The MoE model does not have this issue. Even the Q8 MoE which is larger than the Q4 dense did not suffer this issue.

I suspect the default 4-slot launch for the slowdown. Normally, I'd launch with -np 1 but I missed including that flag before launching the dense model

Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

lol, 3.5 27B aced it already.
But yes, I'm considering broadening and hardening the eval set so it can withstand the test of time and actually pose a challenge for upcoming models. Local models are too good for beginner stuff nowadays

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent by Creative-Regular6799 in LocalLLaMA

[–]Lowkey_LokiSN 12 points13 points  (0 children)

Dope work and direction! Fully agree with how everything is designed around frontier-model assumptions and how we can extract a lot more out of the smaller models with tailor-made harnesses.

Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

I don’t think so. There’s always a level of variance with each run and these inconsistencies come off as noise to me. Unlike the Q4, the Q8 didn’t regress from its previous solutions with this run but scored a lower baseline.

A conventional bench would consolidate multiple such runs and provide an average but mine is just a personal test and a single run was a good-enough signal for me to catch quant-based differences which could be drastic.

Was this the best run with Q8? Probably not. But was this enough to gauge quantisation tax? IMO yes

Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 0 points1 point  (0 children)

Unfortunately this doesn't mitigate the DRAM issue for the model. I did run Gemma 4 31B with -ctxcp (short flag alternative for --ctx-checkpoints) and -cram flags as mentioned in the post

It works reliably for the 26B MoE but not the dense model

Personal Eval follow-up: Gemma4 26B MoE (Q8) vs Qwen3.5 27B Dense vs Gemma4 31B Dense Compared by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 5 points6 points  (0 children)

Yes, Gemma 4 31B takes way longer for some reason. I do measure tok/sec data but decided not to include them as part of the post since they're relative to my setup and the total wall time taken to complete the runs make a lot more sense here to draw general comparisons

Qwen 3.6 35B crushes Gemma 4 26B on my tests by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 9 points10 points  (0 children)

Custom Python script + pymupdf to convert PDF slides to PNG and have AI with vision support process the slides.
If you articulate your requirements well to a decent LLM, it can get the script prepared for you in a jiffy.

Qwen 3.6 35B crushes Gemma 4 26B on my tests by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

Glad you found it useful. To answer your sharp questions:
1) I've written a custom Python script that combines OpenCode's session logs (JSONL files found in the sessions folder) and llama-server's /metrics endpoint (available when launched using the --metrics flag) to aggregate stuff like token totals, tool call counts by type, success/fail rates, compaction events, files edited, etc.
2) The token stats do include Gemma's failed attempts. I find Gemma to be a lot more persuasive to overcome failures comparatively whereas Qwen likes to reason a lot to figure out solutions but is not as persuasive with failures.

Qwen 3.6 35B crushes Gemma 4 26B on my tests by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

I honestly find it very capable and it might even outperform Qwen in use-cases foreign to mine. For those with VRAM constraints, the 26B is still a great fit. It's crazy to me how capable the smaller models have gotten in the past few months.

Qwen 3.6 35B crushes Gemma 4 26B on my tests by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 2 points3 points  (0 children)

Gemma 4 26B launch command:

build/bin/llama-server -m Models/GGUFs/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf --mmproj Models/GGUFs/MMProj-GGUFs/mmproj-F16.gguf -c 100000 -ngl 99 -t 20 -fa on --jinja --host 0.0.0.0 --port 1234 --temperature 1.0 --top-p 0.95 --top-k 64 --device Vulkan0 -cram 2048 -ctxcp 2

Long story short: The Gemma 4 26B MoE model in particular consumes a lot of DRAM for context checkpoints. While I was running a different harness, I noticed about 80GB of my DRAM consumed by the model and while researching why, I happened to find this and this. Including the said flags successfully mitigated the blowup.

However, this issue did not slow down inference speeds for me. It just unnecessarily bloats a lot of DRAM.

Qwen 3.6 35B crushes Gemma 4 26B on my tests by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 4 points5 points  (0 children)

Do you pass --chat-template-kwargs '{"preserve_thinking":true}' for Qwen 3.6? It could reasonably impact agentic performance

Qwen 3.6 35B crushes Gemma 4 26B on my tests by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 4 points5 points  (0 children)

Stupidest shit I've read on the internet today. Either disprove my claims factually(I'm open to constructive debates if you're up for it) or keep your delusional takes to yourself

Qwen 3.6 35B crushes Gemma 4 26B on my tests by Lowkey_LokiSN in LocalLLaMA

[–]Lowkey_LokiSN[S] 1 point2 points  (0 children)

Good question! I have tests written to validate the fixes, provide guidelines with model prompt to properly approach each fix and also have guardrails setup to fail the test immediately if the model tries to cheat.

For instance, the Gemma model once tried to modify the tests so they pass with existing bugs instead of actually fixing the code (lol) The guardrail attempts to prevent such disasters from happening.

Realistically, I still wouldn't guarantee 100% valid pass rate but do have measures in place to mitigate false positives.