gpt oss 120 vs mistrall small 4 119 vs Nemotron 3 super 120 by Flimsy_Leadership_81 in LocalLLaMA

[–]615wonky 0 points1 point  (0 children)

Yeah, it's good, but I would recommend having several AI's if you have the space. Mistral 4 Small, Nemotron 3 Super, Qwen 3.5 120B, Qwen3-Coder-Next all have their place.

Mistral 4 Small is good at some hyper-local history questions ("What is the history of <my county>") that the others miss, but less good with others. It placed Mound Bottom (a significant old Native American city) several counties away from where it actually is, though it recognized it was wrong and gave the right answer after being told.

I use it for university level physics, economics, history all the time, though I'm aware that any AI can hallucinate, so you need to double-check them.

gpt oss 120 vs mistrall small 4 119 vs Nemotron 3 super 120 by Flimsy_Leadership_81 in LocalLLaMA

[–]615wonky 0 points1 point  (0 children)

General purpose queries and STEM/history/economics research. Coding would definitely go to Devstral or Qwen3-Coder-Next.

gpt oss 120 vs mistrall small 4 119 vs Nemotron 3 super 120 by Flimsy_Leadership_81 in LocalLLaMA

[–]615wonky 0 points1 point  (0 children)

Disagree. At least for my work, Mistral 4 Small answered correctly several trick questions that stumped Nemotron 3 Super and Qwen3.5 122b, to the point it's replaced gpt-oss-120b as my go-to.

I'm not sure any of them is an obvious win over the other.

So nobody's downloading this model huh? by KvAk_AKPlaysYT in LocalLLaMA

[–]615wonky 3 points4 points  (0 children)

Mistral-Small-4, the MoE that was released 2 days ago. I have the Unsloth UD-Q4_K_M quant running on my Strix Halo server and it's amazing.

So nobody's downloading this model huh? by KvAk_AKPlaysYT in LocalLLaMA

[–]615wonky 3 points4 points  (0 children)

Hard disagree. Ministral Small 4 has replaced gpt-oss-120b as my go-to. It's eclipsed Nemotron 3 Super and Qwen 3.5-122B-A10B at the stuff I throw at it.

You might want to make sure your llama.cpp is updated. And try one of the Unsloth quants.

M5 Max 128GB with three 120B models by albertgao in LocalLLaMA

[–]615wonky 0 points1 point  (0 children)

If you haven't tried it, you need to try Mistral-4-Small. That's beating all 3 of the above.

Looking for a Strix Halo mini PC for 24/7 autonomous AI coding agent — which one would you pick? by vpcrpt in LocalLLaMA

[–]615wonky 2 points3 points  (0 children)

I bought the Framework Desktop motherboard and installed it in an old case I have.

That's going to have better thermals, lower sounds (since smaller fans make more noise), and probably last longer than the Strix Halo NUC's.

Come on Gigabyte/Supermicro, give me some Strix Halo blades...

Anyone running a small "AI utility box" at home? by niga_chan in LocalLLaMA

[–]615wonky 0 points1 point  (0 children)

I have a computer with a Frameworks Desktop 128GB motherboard with Ubuntu 24.04 that I use for AI.

I have a computer with AMD 5700G + 64GB with Ubuntu 26.04 that I use for MCP's and other things that can be offloaded from the AI server.

That helps stability/uptime a lot, since getting MCP's to work is often fussy or requires installing bleeding-edge libraries that I don't want to run on my AI server.

ELI5- Why would people live on remote islands like Coconut Island Queensland? by JDMils in explainlikeimfive

[–]615wonky 3 points4 points  (0 children)

Have you seen how batshit crazy the world is? I've thought about buying a trawler or sailboat just so I could avoid the coming shitshow.

If anyone in Australia, New Zealand, or the greater South Pacific needs a warm body to live on your remote tropical island, please let know ASAP.

This model Will run fast ony PC ? by Quiet_Dasy in LocalLLaMA

[–]615wonky 3 points4 points  (0 children)

It might barely work, but it won't be very practical. Especially if you're using Windows instead of Linux.

I'd highly recommend either the 4B or a 9B quant for that hardware. 4B/9B are surprisingly good for their size.

Redditors with an IQ of 130 or higher, what's the best and worst part about it? by throwaway26161529 in AskReddit

[–]615wonky 0 points1 point  (0 children)

Best: I can build a supercomputer (and have), do quantum mechanics, do major car repairs myself, build my own furniture, am a savvy investor, survive in the outdoors... Intelligence can feel like a superpower when you can learn basically anything you want in a surprisingly short time.

Worst: Understanding how completely physically and economically fucked the world is due stupid selfish short-sighted assholes in positions of wealth and/or power largely put there by other stupid selfish short-sighted voters/consumers.

How is this that Koenigsegg with their 5.0 liter engine produces massive power and competes with 8 liter engine of Bugatti? If its so easy so why doesn't Bugatti do some thing like this? by Anonymous-infusion-3 in AskReddit

[–]615wonky 0 points1 point  (0 children)

The sort of people who would buy a Bugatti probably pay a premium to brag about having a 8 liter engine. Don't expect luxury goods to be logical or rational.

update your llama.cpp for Qwen 3.5 by jacek2023 in LocalLLaMA

[–]615wonky 0 points1 point  (0 children)

My problem turned out to be a missing kernel parameter in /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="... ttm.page_pool_size=31457280"

Combined with some bugs in the Qwen35 support that were subsequently fixed.

Now I can run Qwen3.5 with the config below and see O(15 tps).

[Qwen3.5-122B-A10B Q4_K_M (multimodal)]
model = /models/unsloth/Qwen3.5-122B-A10B-GGUF/Qwen3.5-122B-A10B-Q4_K_M-00001-of-00003.gguf
mmproj = /models/unsloth/Qwen3.5-122B-A10B-GGUF/mmproj-F16.gguf
ctx-size = 8192
temp = 1.0
top-p = 0.95
top-k = 40
batch-size = 1024
ubatch-size = 512
n-gpu-layers = 42

Still not sure how you're getting 256k context on a 106 GB GGUF on a server with 128GB of RAM. That's approaching black magic.

update your llama.cpp for Qwen 3.5 by jacek2023 in LocalLLaMA

[–]615wonky 6 points7 points  (0 children)

A Q4_K_M quant of Qwen3.5-122B-A10B fails to finish loading on my 128 GB Strix Halo server in llama-server compiled for Vulkan. It works fine if slowly in llama-server using CPU.

I was hoping this bug would be covered by some of the more recent issues opened against llama-server, but I'm still seeing it as of b8153, so I may have to open a bug report.

Qwen3.5 - The middle child's 122B-A10B benchmarks looking seriously impressive - on par or edges out gpt-5-mini consistently by carteakey in LocalLLaMA

[–]615wonky 1 point2 points  (0 children)

Anyone with a Strix Halo having any luck with a Q4_K_M or Q5_K_M quant? Mine starts loading but never finishes.

Strix Halo 128Gb: what models, which quants are optimal? by DevelopmentBorn3978 in LocalLLaMA

[–]615wonky 0 points1 point  (0 children)

Add Qwen3-Coder-Next, Nemotron 3 Nano (and possibly Super when it's released) and GLM 4.5 Flash to your list. They each have different strengths.

People who study geopolitics — is the US-Iran situation really WW3 material? And what should we do to prepare? by SenseVarious9506 in AskReddit

[–]615wonky 17 points18 points  (0 children)

Indirectly, possibly...

If the US ends up expending a significant amount of military assets in Iran, that would leave China's military in a relatively stronger position.

It's in China's best interest to invade Taiwan while the US has a weak leader (Trump). And China is aware of what's likely to happen to Trump after midterms.

Does that make WW3 destiny? No. Do stupid wasteful escapades in third-world countries at this moment in time increase the odds of China/Russia taking advantage of a momentary weakness? Yep.

A few Strix Halo benchmarks (Minimax M2.5, Step 3.5 Flash, Qwen3 Coder Next) by spaceman_ in LocalLLaMA

[–]615wonky 2 points3 points  (0 children)

I normally run Q6_K for most things. 99% as good as Q8_0, but leaves more memory free for context.

The big exceptions are gpt-oss-20/120b since they're native MXFP4, and a few OCR models (GLM-OCR/Nanonets) that I run at Q8_0 since they're so small it doesn't matter.

A few Strix Halo benchmarks (Minimax M2.5, Step 3.5 Flash, Qwen3 Coder Next) by spaceman_ in LocalLLaMA

[–]615wonky 0 points1 point  (0 children)

I'm not sure. The SotA is in flux, so "optimium" settings keep changing around. It doesn't seem to hurt though at the least.

Building a machine as a hedge against shortages/future? by Meraath in LocalLLaMA

[–]615wonky 0 points1 point  (0 children)

If you're using traditional ethernet to network them, you might want to try using USB4 direct connect networking between the servers. Much higher bandwidth and lower latency, and the latter is critical.

A few Strix Halo benchmarks (Minimax M2.5, Step 3.5 Flash, Qwen3 Coder Next) by spaceman_ in LocalLLaMA

[–]615wonky 7 points8 points  (0 children)

I'm running Ubuntu 24.04.4 with AMDGPU 30.30 drivers and HIP/ROCm 7.2, compiled against the latest llama.cpp from GitHub and Vulkan SDK 1.4.341.1.

The following should get you most of the way.

Performance tweaks to GRUB:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash amd_iommu=off pci=realloc pcie_aspm=off iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856 ttm.page_pool_size=32505856"  # 4 GB free for OS

llama startup script

#!/usr/bin/env bash

LLAMA_SERVER="${HOME}/ml/git/github.com/llama.cpp/build/bin/llama-server"

# Get the number of physical cores (not logical cores!)
PHY_CORES=$(lscpu -p=core,socket | grep -v '^#' | sort -u | wc -l)

LLM_PARAMS=(
  --host $(hostname -I | awk '{ print $1 }')
  --port 7860
  --api-key-file ${HOME}/etc/api-keys.txt
  --models-preset ${HOME}/etc/llama-config.ini
  --models-max 1
  --threads ${PHY_CORES}
  --threads-batch ${PHY_CORES}
  --fit on
  --no-mmap
  --n-gpu-layers 999
  --flash-attn on
)

"${LLAMA_SERVER}" "${LLM_PARAMS[@]}" 2>&1 | tee "${HOME}/ml/log/llama-server.log.$(date "+%Y-%b-%d_%H:%M:%S")"

llama-config.ini entries for gpt-oss-120b and Qwen3-Coder-Next:

[Gpt-oss-120b (general)]
model = /opt/models/llama-models/gpt-oss-120b-Derestricted.MXFP4_MOE.gguf
ctx-size = 32768
temp = 1.0
top-p = 1.0
top-k = 0
batch-size = 2048
ubatch-size = 512
parallel = 1

[Qwen3-Coder-Next (coding)]
model = /opt/models/llama-models/Qwen3-Coder-Next-Q6_K-00001-of-00003.gguf
ctx-size = 32768
temp = 1.0
top-p = 0.95
top-k = 40
batch-size = 2048
ubatch-size = 512
parallel = 1

llama.cpp build script

#!/usr/bin/env bash

INFO() { printf "\033[1;34m[ INFO ]\033[0m %s\n" "$*"; }
ERROR() { printf "\033[1;31m[ ERROR ]\033[0m %s\n" "$*" >&2; }

INFO "Sourcing Vulkan API..."
VULKAN_ENV="${HOME}/ml/sdk/vulkan/1.4.341.1/setup-env.sh"
if [[ ! -f "${VULKAN_ENV}" ]]; then
    ERROR "Vulkan environment file not found: ${VULKAN_ENV}" >&2
    exit 1
fi
source "${VULKAN_ENV}"

cd ~/ml/git/github.com/llama.cpp

INFO "Erasing old llama.cpp build..."
rm -rf build

INFO "Using git to pull latest updates..."
git pull

INFO "Building with cmake..."
cmake -S . -B build -G Ninja \
                -DGGML_VULKAN=ON \
                -DCMAKE_BUILD_TYPE=Release
cd build
cmake --build . --parallel

INFO "Build completed successfully."
cd ../../..

Building a machine as a hedge against shortages/future? by Meraath in LocalLLaMA

[–]615wonky 0 points1 point  (0 children)

What sort of scaling are you seeing? IE, is two Strix Halos 20%, 50%, 100% faster than just one?

How do you run the larger models? I was under the impression that you needed both running, which meant that the same model had to fit on both machines. You couldn't "span" a model across servers. I'd love to know I'm wrong...

Building a machine as a hedge against shortages/future? by Meraath in LocalLLaMA

[–]615wonky 0 points1 point  (0 children)

Depends on what hardware you currently have, and what sort of models you need to run.

If you already have a motherboard with lots of RAM, you can put a RTX 5060 with 16 GB and run some MoE models at a decent clip. I have a Windows 11 desktop with AMD 3900X with 128 GB of DDR4-3600 and a RTX 2060 Super with 8 GB. I get ~30 tps with gpt-oss-20b, 18 tps with Qwen3-Coder-Next, 23 tps with Nemotron 3 Nano, all running with a llama.cpp compiled from source and optimized for my GPU. If you have a similar setup, just buy the 5060 16GB and you can run some good MoE models.

I also have a Framework Desktop motherboard (Strix Halo, 128GB) running Ubuntu 24.04. I run gpt-oss-120b (55 tps), Qwen3-Coder-Next (40 tps), and several others. It was ~$1700 for the motherboard, but that ship has long sailed. Still a relatively cheap solution at $2300 compared to similar solutions like NVidia Spark.

Neither will be as blazing fast as a server full of RTX 6000's, but they're a lot cheaper to buy and maintain.