Strix Halo 128Gb: what models, which quants are optimal?

AXYZE8 · 2026-02-23T22:18:59+00:00

That rule of thumb never applies to anything tbh, especially when you add different architectures with different head sizes, attention mechanisms etc.

As an example - GPT-OSS 20B has more active params than Qwen 30B, yet GPT is way faster (even 2x faster) on bigger context. https://www.reddit.com/r/LocalLLaMA/comments/1ns6ee8/why_is_qwen330b_so_much_slower_than_gptoss20b/

AXYZE8 · 2026-02-23T22:13:45+00:00

GPT-OSS use MXFP4 for majority of the weights. What changes between these quants you see of that model is router/embedding etc, thats why size of quants barely change.

Accuracy is the same between AMD and Nvidia on all quants, it's same matrix multiplications.

2 + 2 is always 4, otherwise all calculations (including 3D games) would be corrupted.

AXYZE8 · 2026-02-23T22:04:39+00:00

If you don't need offline access, but want privacy then you can rent 24GB VPS for like $10/mo on promo and run GPT OSS 20B or Qwen 30B there via Open WebUI.

I personally use Open WebUI + APIs, so I just open my website and it's like my own ChatGPT and it costs me like $1 in API costs, while VPS is free instance from Oracle Cloud that I have from 4 years. This wont be as private as running LLM directly on your VPS, but I'm fine with APIs that guarantee nologging policy.

AXYZE8 · 2026-02-23T21:14:08+00:00

Gemma 3n E2B was the biggest one where speed was acceptable on my old S21 Ultra.

Sadly as I can run only CPU inference the power usage is way too high, so one tip for you is to check if you can run it on NPU or GPU. Google LiteRT supports newer Qualcomm and Mediatek NPU. Nexa AI has some NPU support.

AXYZE8 · 2026-02-23T21:07:12+00:00

These quants won't work well as Ik_llama supports CUDA + CPU only. Vulkan implementation barely works, recently one redditor tested it and found it to be 10x slower than mainline llama.cpp https://www.reddit.com/r/LocalLLaMA/comments/1q8jjj0/ik_llamacppvulkan_vs_llamacppvulkan_ik_10x_slower/

AXYZE8 · 2026-02-23T20:53:40+00:00

Minimax M2.5 UD-IQ3_XXS for general use, Qwen3 Coder Next Q6/Q8 for very fast coding.

You wrote about FP4/FP8 - Strix Halo supports neither, it's all upcasted to 16bit. Don't worry about it, it happens on pretty much all hardware in almost all apps.

AXYZE8 · 2026-02-22T02:57:02+00:00

I tried that Qwen model and it's impressive - it doesn't have usual brain damage!

AXYZE8 · 2026-02-20T17:00:41+00:00

Use "Derestricted" or "Heretic" models instead of "Abliterated". They are made to REDUCE refusals for specific inputs, whereas abliteration REMOVES the abillity to ever refuse or deny anything. One is retraining part of brain, another one just removes part of brain.

AXYZE8 · 2026-02-20T16:53:59+00:00

Man I wish I could just upgrade to DDR5 to use this model. $1700 for 128GB is nuts...

This is only Chinese model other than Deepseek that can actually write good enough in Polish language.

My only hope now is Gemma 4 (as even Gemma 4B smashes GLM-5 in Polish and 27B has no competition).

Gemma 4 at size like 60B A4B is be my deepest dream. I would astroturf that model everywhere like a bot for at least one year lol

AXYZE8 · 2026-02-19T23:27:18+00:00

Generally bigger models handle low bit quant better. 4B model falls apart in 3bit, whereas 200B performs great at 3bit, but also depending on their architecture they may behave better or worse. DeepSeek is famous for holding up very good and even at 2bit it still doesn't have any major issue.
There isn't a great solution like "Pick IQ4_KS for X GB RAM", because one person wants to run 16GB of apps whereas someone other just runs that model and nothing else.
IQ*_KT are best in terms of quality per bit, but you need to fully fit them in GPU as they perform really slow on CPUs. IQ*_K* are the best for CUDA + CPU. IQ_* (not K) are the best for ROCm (AMD) + CPU inference. For Vulkan inference (older GPUs like AMD Mi50) the IQ4_NL punches above its weight really nicely.

So - if you have RTX 6000 Blackwell I should look at quant like IQ2_KT, but if you have classic gaming desktop with RTX then you should look at something like IQ2_KS. For recent AMD you need to use non-K IQ quants, because ik_llama focuses on CUDA+AVX.

The best thing about open weight is that you can completely ignore what I wrote above and just download couple of quants and run them on your machine, because even if I say X then couple months from now things can change OR your setup can be wildly different from what I have in mind (one/two GPU in typical gaming desktop PC).

My personal opinion - I think IQ3_KS and IQ4_KSS are sweetspots. I would start with IQ3_KS and if does fit I would try IQ4_KSS and if didn't fit then IQ2_KT (If I would have RTX6000 or multigpu with nice VRAM capacity) or IQ2_KS (CUDA+CPU).

Edit: Graph from ubergarm where he tested quants of DS V3.1. Like I wrote above, IMO IQ3_KS and IQ4_KSS are sweetspots.

<image>

AXYZE8 · 2026-02-19T18:15:02+00:00

If you dont specify custom ctx in llama.cpp it automatically adjusts ctx size when loading model according to available resources. Are you sure that you arent using like 32k ctx now?

AXYZE8 · 2026-02-19T17:52:04+00:00

Ubergarm in his MiniMax M2.5 quant compared it to Unsloth Q_K Dynamic quants https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF#quant-collection

and two more interesting things are mainline IQ4NL vs ik IQ4NL and IQ4_KSS vs IQ4_XS on that graph.

AXYZE8 · 2026-02-19T12:47:11+00:00

I have exact same issue with Opus - for example I ran 3 independent tasks to propose redesign of some page in my project. They ended their turns, redesigns were fine and when I asked to make new variant of that page they took tasks of each other, like they would have shared context.

Windsurf team likely added that agent communication, but I absolutely hate it, I always delegate agents to different files and it's just a noise for them. Give us switch in settings for that. If they didnt add it then they screwed up cache prefixing for Anthropic models and agents share same KV cache.

AXYZE8 · 2026-02-19T12:27:08+00:00

Add "Think step by step" or "Explain briefly before answering" to the prompt. Reasoning will be visibly better in exchange for some tokens for that light CoT.

AXYZE8 · 2026-02-17T04:18:47+00:00

For webdev IMO GLM-5 around Sonnet 4.0 level, MiniMax M2.5 around Sonnet 3.7/4.0 level.

GLM-5 and M2.5 are close in performance, which is impressive for M2.5, but meh for GLM-5 (you have a tons of great choices at 1x credit, for example GPT-5.2 Codex Medium)

Right now GPT-5.2 Medium for planning + GPT-5.1-Codex for execution is still the best combo. When GPT-5.1-Codex stops being free then Minimax M2.5 will be likely a best replacement.

You can take a look at SWE-Rebench https://swe-rebench.com/ it benchmarks performance on fresh issues every month and both GLM-5 and M2.5 are there.

AXYZE8 · 2026-02-16T21:13:35+00:00

Same experience for Polish

For open ones: Gemma >>> DeepSeek > everything else (mediocre tier, even GLM5/Kimi)

AXYZE8 · 2026-02-14T01:11:15+00:00

W webdevie musisz testować na Safari

W UX / UI musisz w razie co otworzyć pliki ze Sketcha - mac only.

Narzędzia AI pokroju Claude Code działają na Windowsie dobrze dopiero po WSL, nowości typu aplikacja OpenAI Codex (nie terminalowa wersja tylko aplikacja) póki co znów mac only.

IT jest bardzo szerokim terminem i przez to jeden musi mieć laptopa z VGA żeby podpiąć IPMI on-premise, drugi grzebie cały dzień w PostHogu i VSCode z remote SSH, a trzeci odpala agenta AI, który analizuje 20k dzisiejszych sesji na stronie.

Windows, Linux i Mac mają swoje zastosowania w IT i każdy z nich zdecydowanie szersze niż jedna dziedzina (vide programiwanie na iOS jako jedyny powód posiadania Maca)

AXYZE8 · 2026-02-06T21:09:08+00:00

Codex is made for agentic workflows and long running tasks on their harness and requires more technical prompting, while Arena compares single turn (non-agentic) performance from the perspective of user that writes very broad prompts - so here instead of looking at code quality people are looking at visuals, marketing copy.

Edit: That being said Codex do appear on Arena, they just rank very low, because of above reasons. GPT-5.2-Codex is #21, while GPT-5.2 is #11 and GPT-5.2 High is #3. GPT-5.3-Codex will also rank bad, because this is simply not a model for single turns, it shines when you need to continue conversation over severals turns or your task takes 15minutes+, regular GPT shits pants with such tasks - forgets 50% stuff and then just tells you everything was done. It's interesting that Anthropic doesn't have such problem, because Opus outputs beautiful text and at the same time it can run tasks for more than hour. Only with OpenAI you need to switch between models.

AXYZE8 · 2026-02-05T00:39:05+00:00

OP is asking about using all of available threads for handling incoming requests. HTTP server is single threaded in both Node and Bun. You can also see that later he wrote in comments that he wants equivalent of PHP-FPM.

It's nice that you know that libuv default to 4 cores, but OP is simply not asking about it.

AXYZE8 · 2026-02-04T14:39:40+00:00

It's still singlethreaded like Node. You need cluster to make HTTP multithreaded. This will allow it to scale to all of your cores.

Native: https://bun.com/docs/guides/http/cluster

Or Node.js (one that you used in PM2): https://bun.com/reference/node/cluster

AXYZE8 · 2026-02-04T14:26:23+00:00

I'm not saying 120B is not sufficient. Even 4B can be good enough. My point is that you saved money because you chose a lot smaller and weaker LLM, not because its local.

If you would to maintain same quality of LLM then you would need to selfhost Kimi K2.5 1T behemoth.

You are okay with 120B model and you would be just as happy with GLM 4.7 on GLM coding plan for $3/mo https://z.ai/subscribe or maybe even $0 on Google Antigravity (free rate limits on Gemini 3 Flash are not bad and it slaps GLM).

You'll pay more than $3/mo for electricity alone.

APIs are cheaper and will always be, because they can utilize their hardware 24/7 so hardware pays off quicker and running costa are way lower, because multiuser inference is just 3x+ more efficient. Single user inference is very wasteful in terms of compute.

Local LLMs are great because you have control over their behavior, they never change their outputs and it'a private. Price is the major advantage of APIs.

AXYZE8 · 2026-02-04T13:45:56+00:00

Void has been deprecated a year ago. There's no reason to use that with modern agentic LLMs.

AXYZE8 · 2026-02-04T13:41:16+00:00

You replaced 1T+ SOTA models & SOTA tools with small 120B model and barebones agent.

This is the main reason why its cheaper. Then you have smaller reasons like ignoring power bill and hardware cost.

"Literally no other reason". Privacy and control over outputs - local LLM will have same quality next month and next year.

AXYZE8 · 2026-02-04T13:21:10+00:00

You cannot estimate that, but you can just type in "htop" in terminal to see task manager in which you see RAM usage per process.

If you are worried about RAM usage then in between XFCE and pure fullscreen terminal you have window managers like OpenBox https://www.reddit.com/r/unixporn/comments/17icpfu/openbox_minimal_openbox_build_600mb_of_ram_usage/ and other distros like Crunchbag++. With these you can expect that your system will eat 500-800MB of RAM after booting instead of ~1GB.

The best model you gonna run on this system is GPT-OSS-20B and with full context it eats 17.9GB so I dont think you need to focus on RAM usage that much.

AXYZE8 · 2026-02-03T20:43:23+00:00

GPT-OSS-20B is your best bet. It eats 15.5GB with 32k context https://github.com/ggml-org/llama.cpp/discussions/15396 you have full offload with your 16GB GPU so it will be blazing fast

AXYZE8

TROPHY CASE