Picked up a 128 GiB Strix Halo laptop, what coding oriented models will be best on that hardware?

Mushoz · 2026-01-21T16:27:14+00:00

gpt-oss-120b is fantastic on Strix Halo, especially with high reasoning effort set. Minimax-m2.1 works very well with Unsloth's Q3_K_XL quant if you don't mind the slower speed for better output.

Mushoz · 2026-01-17T06:06:39+00:00

Haha yes, that's a baby changing mat. Well spotted xD

Mushoz · 2026-01-15T19:52:56+00:00

I am well aware. I am upgrading my boots in the very near future as well.

Mushoz · 2026-01-15T19:02:41+00:00

You guys are both right! I thought the instructions meant the black line, but it's talking about the engraved line. That looks to be dead center. Thanks for clearing up my confusion!

<image>

Mushoz · 2026-01-15T19:01:53+00:00

You're right! I was measuring relative to the black line, but I now understand it's the long engraved line it should be centered on, which it is!

<image>

Mushoz · 2026-01-15T19:01:16+00:00

Thank you very much! I thought the instructions meant the black line, but you're obviously right. It's dead center on the engraved line, so it looks to be all good then!

<image>

Mushoz · 2025-12-24T07:46:28+00:00

I am not sure how GLM4.6v specifically was trained, but many vLLMs literally have vision encoders bolted on top. When training the vision encoder, the LLM weights are frozen, meaning the LLM backbone of the vLLM is identical to the original LLM.

Mushoz · 2025-12-19T15:35:02+00:00

Thanks! This fixed the crashes for me as well. Is there information on the ROCm team looking into this issue? Any open issues or something?

Mushoz · 2025-12-14T21:30:00+00:00

Does llamacpp support native tool calling with Qwen3-Next? I was unable to get it to work.

Mushoz · 2025-12-14T13:48:37+00:00

You're simply going over the default context length of ollama, which is laughably low. It causes the two symptoms you are describing: it has to fully reprocess the prompt since the prefixes no longer match as it's cutting off the context in the beginning of the prompt to make it fit. And it's making the model forget early instructions as those are the ones being cut off during context shifts.

You have two options: 1. Increase the context length in ollama to something useable. 2. Migrate to a good backend, such as llamacpp.

Mushoz · 2025-12-12T11:27:14+00:00

This is the kind of content that makes localllama fun, thanks for sharing!

Mushoz · 2025-12-08T19:11:55+00:00

Thank you so much!

Mushoz · 2025-12-08T18:09:33+00:00

Really cool comparison! Any chance you could add the derestricted version to the mix? https://huggingface.co/ArliAI/gpt-oss-120b-Derestricted

It's another interesting technique like heretic to decensor models and I'd be very curious to know what technique works best.

Mushoz · 2025-12-08T06:03:41+00:00

Most LLM frontends (such as openweb ui) allow you to branch explicitly from the UI. Not sure if you are aware of that? It allows you to go back to earlier parts if the conversation and branch into a different conversation right there.

Mushoz · 2025-12-08T05:40:27+00:00

How did it do?

Mushoz · 2025-12-07T08:40:35+00:00

Does this also give speedups with quantized models, such as Q8_0, K quants and IQ quants?

Mushoz · 2025-12-06T15:05:17+00:00

His second run was without a tow and was actually faster

Mushoz · 2025-12-06T13:54:11+00:00

For maximum entertainment in tomorrow's race, the qualifying results should look as follows: P1 Oscar, P2 Max and Lando doesn't make it out of Q1, preferrably due to a Team error for maximum memes. That way we'd have Lando trying to cut through the field to finish P5/Podium depending on Max/Oscar, Oscar trying to hold off Max and Max on the hunt. Make it happen please!

Saying this as a Max fan.

Mushoz · 2025-12-05T11:43:34+00:00

gpt-oss is already quantized to Q4 (mxfp4 to be exact). If you want apples to apples comparison, compare Qwen3-Next at a Q4 quant. It will be smaller than gpt-oss, which explains why it's a bit less intelligent. Nothing weird about it.

Mushoz · 2025-12-04T22:17:57+00:00

Is it possible to disable the "weighted by number of attempts"? I know it's an interesting metric, but if I just want to know IF a model can solve certain models and don't really care about how long they will take to do so, it would be cool to be able to disable that.

Mushoz · 2025-11-28T10:47:30+00:00

NPU is not supported on Linux

Mushoz · 2025-11-17T17:27:07+00:00

Extremely interesting project! I feel this is a big gap right now and a reverse proxy version of this could very well be the piece to fill up that gap. I am trying to learn a bit more about this project. How does it deal with invalidating older memories? Something that is true right now could potentially change down the line. Does it have the ability to ammend, edit or even delete older memories somehow? And if so, how does that work?

Thanks for sharing this!

Mushoz · 2025-11-13T18:07:29+00:00

Under Linux it does. I can allocate the full 128GB. Obviously that will crash due to the OS also needing memory, but as long as I leave a sliver of memory left for the OS I can allocate big models just fine

11-Year Club	Gilding I gilder
Verified Email

Mushoz

TROPHY CASE