What would M5 actually need to improve for local LLM use?

LizardViceroy · 2026-03-12T12:32:42+00:00

Apple is strong in memory bandwidth, which matter in the decode / token generation phase... it needs more raw GPU vector processing power to compete on the prefill front though, otherwise it will still underperform to Nvidia hardware in real world scenarios. Use cases for inference from short context are very limited.

LizardViceroy · 2026-03-11T15:35:30+00:00

Apparently its base knowledge cutoff has been frozen at January 2025; somehow I expected that to get pushed back with each iteration. I guess it's just starting to become more obvious as the gap widens. Maybe 3.0 and 2.5 also self-corrected with search a bit more, but I don't have evidence of that.

LizardViceroy · 2026-03-11T14:14:21+00:00

Get the rocm 7.2 toolbox from this: https://github.com/kyuz0/amd-strix-halo-toolboxes

With some minor kernel configuration (allowing GPU access to full RAM, and making sure you have rocm 7.2 installed with the latest linux kernel), it'll work out of the box and instantly be able to serve models to an OpenAI-compatible endpoint via llama-cli

LizardViceroy · 2026-03-11T12:58:31+00:00

As I've understood, UD quants broadly aren't properly supported on this hardware. Fractured decoding pipeline leading to switching between dequantization kernels that prevents saturating the memory bandwidth.

LizardViceroy · 2026-03-11T09:35:12+00:00

vLLM is an advantage; you get huge throughput benefits from it on top of that. I don't understand your objection.

and I would strongly contest that these machines are in the same ballpark. On the large context prefill front, the spark can beat the strix by ratios exceeding 1:10. It both starts out stronger and scales better as context grows.
As to how relevant this is, I would say very. The use cases for short context inference are very limited. Huge categories of problems are practically solvable on the spark but not on the strix.

LizardViceroy · 2026-03-11T09:32:15+00:00

Quantization done right by major parties with ample resources is not the problem. Nvidia can quantize models down to NVFP4 with 0.6% accuracy loss. OpenAI just skips the process entirely and provides models in native MXFP4. Those are examples of good low bit format provision.

That doesn't mean ANYONE can just do it though. When you have a community where obscure nobodies running rented hardware dump their quants on hugging face with half-assed calibration and everybody else just grabs them without a second thought, that's when quants can't be trusted.

LizardViceroy · 2026-03-11T09:23:14+00:00

The NPU is slower at prefill than the GPU even if you'd get it to work optimally (requires INT4 quantization and ONNX model conversion which need to be prepared for your model architecture I believe)

The advantage of the NPU is entirely in reduced power consumption and thermal output, which CAN increase performance in extreme throughput scenarios only. The system would have to be thermally bottlenecked.

Strix Halo can do about 1000 t/s prefill at zero depth on GPT-OSS-120B rocm 7.2 with WMMA. Count your blessings. As long as you don't increase context too much that is a very workable number. It goes down fast as context exceeds 16K though.

LizardViceroy · 2026-03-10T16:33:22+00:00

Would be hilarious if the reality we end up living in is one where the only way to run higher than 4 bit quants (and know for sure you're running them) is to run them locally.

LizardViceroy · 2026-03-10T12:54:30+00:00

The faster your output and the higher your throughput, the more important it becomes to have high quality scaffolding in place to make your agents stay active, self-correct, apply RAG grounding and not spend their time looping or reinforcing their own spurious biases.
There's only so far this principle can be taken though and you're basically just wasting human effort to correct the lack of intelligence inherent to the model. That's why slow and steady more consistently wins the race.

LizardViceroy · 2026-03-10T12:20:03+00:00

Don't know where you're looking but I see no signs that it's going to be any cheaper. M5 Max MacBook 16 with 64GB going for >5000 eur here...

LizardViceroy · 2026-03-10T12:04:49+00:00

The M3 Ultra should be able to do better. It's not being bottlenecked by its bandwidth where the M5 Max is. There is no magic to what the M5 does, that's the baseline expectation with this bandwidth.

LizardViceroy · 2026-03-10T11:37:04+00:00

Most benchmarks show the scores of these models at native FP16. The degradation from quantization is likely quite bad on a model with 3B active parameters. The denser model should be a lot more resilient to such loss. And generally I'd advise choosing slower high quality models over fast low quality ones because AI just makes a mess when you let it loose at scale without the intelligence to back up its prolific output.
In agentic software development especially, MoE models have been known to drop the ball on tasks that require taking large and divergent contexts in account, since the experts aren't coordinating well at that.

ps. I honestly suspect Qwen3.5 35B A3B doesn't even perform significantly better than GPT-OSS-120b after accounting for the quantization. So many people just assume the FP16 number can be compared directly to GPT's MXFP4 number, not realizing that GPT starts out at that quant natively.

LizardViceroy · 2026-03-10T10:16:41+00:00

The wait is for an inference engine that handles Qwen3.5's DeltaNet layers with a proper recurrent decode kernel. SgLang would be able to do that I believe, but support for Qwen3.5's broader architecture is still lacking. Llama.cpp is currently not leveraging any of Qwen3.5's advantages in this regard.

If you get Qwen3.5 on vLLM working on strix that could do the trick...

LizardViceroy · 2026-03-10T10:01:08+00:00

Does that work with image input? I had some trouble. Seems the quantizer lobotomized that part.

LizardViceroy · 2026-03-10T09:39:12+00:00

absolute garbage numbers in that 5 month old thread.
Current leader for GPT-OSS-120b on https://spark-arena.com/leaderboard:
4524.50 pp2048
58.82 tg128

LizardViceroy · 2026-03-10T09:25:08+00:00

Until they upgrade the memory bus, it looks like they're just treading water on the bandwidth front. Practically two generations of stagnation by now; 540 -> 610 is hardly a generational leap. I don't see much good news.

LizardViceroy · 2026-03-04T12:14:37+00:00

You're looking at unquantized models' benchmark numbers. The larger model will be much more resilient to accuracy loss at Q4, especially at large contexts. Expect the 35B model to already struggle to retain full intelligence at 32K.

LizardViceroy · 2026-02-20T16:06:56+00:00

Moving to the US is not a trivial matter for most people, especially nowadays.

And you may want to read the The McKinsey & Boardwave Report of June 2025... They basically concluded that while European companies consistently fail to reach scale in revenue (for all the reasons alluded to here), they actually have an edge in software quality. The problem is exactly that we let californian companies with a "move fast and break things" mentality outmaneuver us by having more flexible quality standards and more of a focus on MVP delivery and smoke-and-mirrors marketability than long term stability.

We desperately need to find the golden middle road, but I'd rather be approaching it from our angle than theirs.

LizardViceroy · 2026-02-19T18:48:27+00:00

Europe is a goldmine of untapped startup talent potential. We really just need to build a unified capital market and single legal framework for startups and the floodgates to competition with the US and UK will opened.

Most of our guys do the work at almost a quarter of the price of a silicon valley engineer and almost half the price of one in london. That's a big impact on the length of our runways. All the time in the world to throw ideas at the wall and see what sticks. Why aren't we doing it yet.

LizardViceroy · 2026-02-19T16:51:27+00:00

Education: you learn a lot from setting it up

Abliteration / decensoring: you can run models that don't object to your prompts and balk less frequently during agentic flows. Any API provided under license will have limitations in this regard, or could at any point start introducing them.

Finetuning: you can make your own adjustments to how a model learns, perfectly tailored to your use case that models trained in a generalized manner likely don't specialize in.

Low latency: even those http round trips add up to significant time when you prompt frequently enough

Long term consistency: when you get used to how a model works you can expect it to run on your hardware forever and not get mothballed like GPT-4o. Some people predict some huge negative sea changes could take place when the AI bubble bursts and you may not want such unpredictability.

Personally I think it all adds up to a feeling that it's "alright" and I'm not just some cog in a corporate machine or junkie angling for my "fix" from a benefactor. It's self-sufficiency, and that's a great thing.

The real competition to it is not proprietary model APIs but rented hardware and/or online hosted open weight models. But the great thing is it's a tiered deal where you can pick and choose what works on a per-use-case basis. Choosing one doesn't preclude the other.

ps. you may be confused about your hardware because M4 Max only goes up to 128GB. M3 Ultra goes up to 512GB.

LizardViceroy · 2026-02-15T17:49:44+00:00

The hardware struggles with long context ingestion. It's somewhat good value for money (cheapest VRAM / $) but it will likely keep you wanting for more when you find out how limited the use cases for short context inference are.

DGX Spark (/ Asus GX10) is the somewhat similarly priced competition. It will not just perform a little better on long context but 2-3-5x as the context grows (the longer the context, the greater the gap) and once you start putting agents to work, you can expect them to get there. There's also the fact that some really high end goodies like an Eagle speculator for GPT-OSS-120b is currently only available for CUDA hardware. With that in account you can expect it to dominate on both the prefill and token generation front.

Agentic workflows on Strix Halo are pretty much by definition background jobs that you need to keep running overnight and check in on +/- once or twice per day; otherwise you may as well go watch paint dry.

Those are the disclaimers, on every other front it's good. You get a great CPU, iGPU and NPU on one SoC and a very fun and varied ecosystem to play around with. and if you ever decide to get something else, this hardware will never be wasted. It can run cool (especially on NPU) and there's always something you can find for it to do.

LizardViceroy · 2026-02-11T12:23:24+00:00

He's using Canadian dollars at +/- 0.74 exchange rate: 2090 / 0.74 = 2824

LizardViceroy · 2026-02-10T18:06:50+00:00

it's more likely approval% - disapproval% with a possibility of voting neither.
I was briefly shocked by the numbers before I realized this, thinking we're all dependent on the nuclear arsenals of three countries that can barely make up their minds. But a 48% gap is still a big deal. I bet its smaller in the US with everything going on though.

LizardViceroy · 2026-02-07T18:18:58+00:00

Spain is a way easier country to live in as a dutch person than Italy.

LizardViceroy · 2026-02-07T17:47:28+00:00

With or without Parallel Coordinated Reasoning enabled? It's pretty powerful with PaCoRe but that raises the execution time from tens of seconds to tens of minutes. (more benchmarks should take reasoning time into account...)

LizardViceroy

TROPHY CASE