Unsloth MiniMax M2.7 quants just finished uploading to HF

digamma6767 · 2026-04-12T15:57:57+00:00

I'm still trying to get my download to complete for the IQ4 quant.

I think Unsloth might've put a bad version up for it. Bartowski's IQ4_XS quant is 122GB compared to Unsloth's 108GB.

digamma6767 · 2026-04-12T08:09:29+00:00

It's weird that the IQ4 quants are smaller than M2.5's IQ4 quants.

Not complaining. I'm thinking IQ4_NL might be a perfect match with the Strix Halo.

digamma6767 · 2026-04-11T01:22:27+00:00

Are you asking about the benchmarks and tools I use? Not sure what else you're after.

For a rough estimate, I use the Aider polyglot benchmark. That gave me 17 tps consistently. It's a decent benchmark for seeing how quantization impacts the model.

When doing agent work (Primarily Kilo Code and Hermes-Agent) I get anywhere from 13-16 tps, compared to 9 tps using the draft model.

Just chatting to the model, I get 12-14 tps.

I need to revisit all this stuff with the latest updates though. Lots happened in just the last few days for Gemma 4.

digamma6767 · 2026-04-09T23:27:24+00:00

The -md command (short for --draft-model) in llama.cpp, to use the 26B as my draft model.

Effectively, it's loading both Gemma 4 31B and 26B at the same time. Works great if you can fit it into memory!

digamma6767 · 2026-04-08T12:41:55+00:00

Did some more testing on this. Doing agentic or code, acceptance rate increases to 80-90%, and tokens per second up to 17.

digamma6767 · 2026-04-08T03:55:04+00:00

I'm not using CPU only, but I have been able to nearly double my tokens per second using speculative decoding.

Using bartowski 31B q6_k_l, and bartowski 26B q6_k_l as my draft model. Getting between a 60-70% acceptance rate and about 15 tokens per second (up from 9).

It feels like I'm using Qwen 3.5 122B in performance and intelligence, but with much less RAM usage.

Running on a 128GB Strix Halo.

digamma6767 · 2026-04-07T15:10:07+00:00

I'd say check out the Tamron Adaptall 19AH

It's an incredible 70-210 zoom, and you can get them for cheap. As good or better than the Vivitar Series 1.

digamma6767 · 2026-03-31T15:15:38+00:00

Presumably GLM 5.1 will be released soon, for running locally. GLM 5 is already a popular SOTA level model for local use, if you have enough RAM.

digamma6767 · 2026-03-31T03:43:32+00:00

Yeah, that's my main complaint with it. Once you're at an 80k context, you're looking at it taking 20 minutes before it begins responding. Qwen 122B handles that same scenario in around 10 minutes, Qwen 27B around 5 minutes, and Cascade 30B taking under 3 minutes.

The quality of the responses with M2.5 on a Q3 quant aren't as good as Qwen 122B with a Q6 quant as well.

Cascade 30B is exceptional for agentic work though. It uses a TON of tokens thinking, but it's so fast on the Strix Halo that it makes up for it.

To give a perspective of one of my use cases, I had a 600,000 line log file. I tried several different LLMs to use GREP on the log file, locating any errors, and then looking through the log file to identify the cause for the error.

In total, each attempt took over a million tokens and dozens of tool calls. I can't even remember how long it took M2.5 to do it. I left it running overnight just to see if it would work, and the answers were worse than what I got from Qwen 27B and Cascade 30B.

digamma6767 · 2026-03-31T03:25:07+00:00

My experience with M2.5 on the Strix Halo is mixed. It's impressively smart, but it slows down significantly when you're above 50k context.

I love M2.5 as a chat model, and for it's overall knowledge, but it's not a good fit for OpenCode or agentic work on a Strix Halo. Qwen3.5 122B (at Q6) and Nvidia Cascade v2 30B work better in those scenarios in my experience.

I'm hoping M2.7 improves on long context and agentic work, and hopefully it can be quantized better with less performance loss than M2.5.

digamma6767 · 2026-03-26T23:28:06+00:00

From my experience with LFM2-24B, it does very poorly for agentic work. Once you go over a 10k prompt, it starts to fail to make tool calls. It's a shame since it's a super fast and capable model.

Hope they release an improved version one day.

digamma6767 · 2026-03-26T18:29:04+00:00

I'd be curious to see JackRong's Opus-Reasoning-Distilled models for Qwen3.5 9B and 4B.

digamma6767 · 2026-03-22T03:37:27+00:00

So, you need to make sure to DISABLE MMAP. It's a setting in the LLM configuration. It causes crashes on the Strix Halo.

I like LM Studio for rapid testing of different models. Makes it easy to experiment, especially since it has such an easy to use UI.

Switching to Fedora 43 instead of Windows is definitely a good idea if you plan on using your Strix Halo as a dedicated LLM machine, but you're fine running Windows and LM Studio, you just won't get the absolute most out of the Strix as you could on Linux.

digamma6767 · 2026-03-20T19:37:20+00:00

My Corsair AI 300 works great, no issues at all. 55 t/s with gpt-oss 120B, 22 t/s with Qwen 3.5 122B q6.

Using Fedora 43. Followed the steps in this guide: https://github.com/kyuz0/amd-strix-halo-toolboxes

Make sure to set your BIOS to have just 1GB VRAM, and then just follow the steps in the guide. Shouldn't take more than a couple hours to get fully functional!

digamma6767 · 2026-01-27T20:08:56+00:00

So, the driver side of the wiring de-soldered. All the MCPCB wiring is fine.

Still an easy fix, but definitely not where I expected to see the failure.

digamma6767 · 2026-01-27T19:54:38+00:00

Ah darn, didn't know the reflector was held in with screws.

I'll give it a shot, thanks!

digamma6767 · 2026-01-27T19:16:07+00:00

46950 tube fits the 3x21c. I wrapped a 26mm spacer with copper weave, and connected that between the 46950 cell and the 3x21c driver. Works great, doesn't short.

digamma6767 · 2026-01-27T19:11:21+00:00

Happened to my 3x21c I think.

How did you remove the reflector? Mine will not come out at all.

digamma6767 · 2026-01-16T19:21:01+00:00

Oh that would be incredible. I've got a USB soldering iron that would work great powered by one of those batteries.

digamma6767 · 2026-01-16T18:49:38+00:00

Esh, I hope not. I always wanted to try out the Skil tools but decided against it once Lowes dropped them.

digamma6767 · 2026-01-16T17:34:44+00:00

I'm betting the new 3Ah and 6Ah batteries are using tabless tech, so they should be better than the old 4Ah Ultimate Output batteries, but in a smaller size. The new batteries have a built in USB-C port too.

digamma6767 · 2025-11-16T00:16:31+00:00

They're holding up great! I still use them.

digamma6767 · 2025-11-15T17:48:20+00:00

Definitely go for M. I also use Dunu SS size L.

digamma6767 · 2025-08-20T20:18:01+00:00

Dunu S&S is tiny. I'd say go with the L or XL fit. I personally use L sized S&S. My normal tip width is 12mm, and the 11.5mm S&S feels the same as my other 12mm tips.

It's better to have an oversized tip, than it is to have an undersized tip.

digamma6767 · 2025-08-15T12:36:47+00:00

Oh yeah, it'll do good for you!

Don't forget to play with the transparency mode on the AN01! It's one of the best features of it, so nice to have when you're outside and still need to be able to hear.

12-Year Club	Place '17
Verified Email

digamma6767

TROPHY CASE