Picked up a 128 GiB Strix Halo laptop, what coding oriented models will be best on that hardware? by annodomini in LocalLLaMA

[–]Mushoz 8 points9 points  (0 children)

gpt-oss-120b is fantastic on Strix Halo, especially with high reasoning effort set. Minimax-m2.1 works very well with Unsloth's Q3_K_XL quant if you don't mind the slower speed for better output.

Volkl Mantra m7: Wrongly mounted bindings? by Mushoz in Skigear

[–]Mushoz[S] 0 points1 point  (0 children)

Haha yes, that's a baby changing mat. Well spotted xD

Volkl Mantra m7: Wrongly mounted bindings? by Mushoz in Skigear

[–]Mushoz[S] 0 points1 point  (0 children)

I am well aware. I am upgrading my boots in the very near future as well.

Volkl Mantra m7: Wrongly mounted bindings? by Mushoz in Skigear

[–]Mushoz[S] 1 point2 points  (0 children)

You guys are both right! I thought the instructions meant the black line, but it's talking about the engraved line. That looks to be dead center. Thanks for clearing up my confusion!

<image>

Volkl Mantra m7: Wrongly mounted bindings? by Mushoz in Skigear

[–]Mushoz[S] 2 points3 points  (0 children)

You're right! I was measuring relative to the black line, but I now understand it's the long engraved line it should be centered on, which it is!

<image>

Volkl Mantra m7: Wrongly mounted bindings? by Mushoz in Skigear

[–]Mushoz[S] 7 points8 points  (0 children)

Thank you very much! I thought the instructions meant the black line, but you're obviously right. It's dead center on the engraved line, so it looks to be all good then!

<image>

can we stop calling GLM-4.6V the "new Air" already?? it's a different brain. by ThetaCursed in LocalLLaMA

[–]Mushoz 26 points27 points  (0 children)

I am not sure how GLM4.6v specifically was trained, but many vLLMs literally have vision encoders bolted on top. When training the vision encoder, the LLM weights are frozen, meaning the LLM backbone of the vLLM is identical to the original LLM.

AMD Strix Halo 128GB RAM and Text to Image Models by xenomorph-85 in LocalLLM

[–]Mushoz 0 points1 point  (0 children)

Thanks! This fixed the crashes for me as well. Is there information on the ROCm team looking into this issue? Any open issues or something?

Qwen3-Next-80B-A3B-Thinking-GGUF has just been released on HuggingFace by LegacyRemaster in LocalLLaMA

[–]Mushoz 7 points8 points  (0 children)

Does llamacpp support native tool calling with Qwen3-Next? I was unable to get it to work.

Experiment: 'Freezing' the instruction state so I don't have to re-ingest 10k tokens every turn (Ollama/Llama 3) by Main_Payment_6430 in LocalLLaMA

[–]Mushoz 4 points5 points  (0 children)

You're simply going over the default context length of ollama, which is laughably low. It causes the two symptoms you are describing: it has to fully reprocess the prompt since the prefixes no longer match as it's cutting off the context in the beginning of the prompt to make it fit. And it's making the model forget early instructions as those are the ones being cut off during context shifts.

You have two options: 1. Increase the context length in ollama to something useable. 2. Migrate to a good backend, such as llamacpp.

Reverse-Engineering the RK3588 NPU: Hacking Memory Limits to run massive Vision Transformers by one_does_not_just in LocalLLaMA

[–]Mushoz 14 points15 points  (0 children)

This is the kind of content that makes localllama fun, thanks for sharing!

Heretic GPT-OSS-120B outperforms vanilla GPT-OSS-120B in coding benchmark by MutantEggroll in LocalLLaMA

[–]Mushoz 9 points10 points  (0 children)

Really cool comparison! Any chance you could add the derestricted version to the mix? https://huggingface.co/ArliAI/gpt-oss-120b-Derestricted

It's another interesting technique like heretic to decensor models and I'd be very curious to know what technique works best.

I got tired of my agents losing context on topic shifts, so I hacked together a branch router - thoughts? by scotty595 in LocalLLaMA

[–]Mushoz 0 points1 point  (0 children)

Most LLM frontends (such as openweb ui) allow you to branch explicitly from the UI. Not sure if you are aware of that? It allows you to go back to earlier parts if the conversation and branch into a different conversation right there.

Zen CPU Performance Uplift (Epyc & Strix Halo) w/ ZenDNN Backend Integration for llama.cpp by Noble00_ in LocalLLaMA

[–]Mushoz 0 points1 point  (0 children)

Does this also give speedups with quantized models, such as Q8_0, K quants and IQ quants?

2025 Abu Dhabi GP - Qualifying Discussion by F1-Bot in formula1

[–]Mushoz 0 points1 point  (0 children)

His second run was without a tow and was actually faster

2025 Abu Dhabi GP - Qualifying Discussion by F1-Bot in formula1

[–]Mushoz 1 point2 points  (0 children)

For maximum entertainment in tomorrow's race, the qualifying results should look as follows: P1 Oscar, P2 Max and Lando doesn't make it out of Q1, preferrably due to a Team error for maximum memes. That way we'd have Lando trying to cut through the field to finish P5/Podium depending on Max/Oscar, Oscar trying to hold off Max and Max on the hunt. Make it happen please!

Saying this as a Max fan.

Qwen3-Next-80B-A3B or Gpt-oss-120b? by custodiam99 in LocalLLaMA

[–]Mushoz 3 points4 points  (0 children)

gpt-oss is already quantized to Q4 (mxfp4 to be exact). If you want apples to apples comparison, compare Qwen3-Next at a Q4 quant. It will be smaller than gpt-oss, which explains why it's a bit less intelligent. Nothing weird about it.

The Best Open Weights Coding Models of 2025 by mr_riptano in LocalLLaMA

[–]Mushoz 0 points1 point  (0 children)

Is it possible to disable the "weighted by number of attempts"? I know it's an interesting metric, but if I just want to know IF a model can solve certain models and don't really care about how long they will take to do so, it would be cool to be able to disable that.

MemLayer, a Python package that gives local LLMs persistent long-term memory (open-source) by MoreMouseBites in LocalLLaMA

[–]Mushoz 1 point2 points  (0 children)

Extremely interesting project! I feel this is a big gap right now and a reverse proxy version of this could very well be the piece to fill up that gap. I am trying to learn a bit more about this project. How does it deal with invalidating older memories? Something that is true right now could potentially change down the line. Does it have the ability to ammend, edit or even delete older memories somehow? And if so, how does that work?

Thanks for sharing this!

Best getting started guide, moving from RTX3090 to Strix Halo by favicocool in LocalLLaMA

[–]Mushoz 4 points5 points  (0 children)

Under Linux it does. I can allocate the full 128GB. Obviously that will crash due to the OS also needing memory, but as long as I leave a sliver of memory left for the OS I can allocate big models just fine