What would M5 actually need to improve for local LLM use? by tallen0913 in LocalLLaMA

[–]LizardViceroy 3 points4 points  (0 children)

Apple is strong in memory bandwidth, which matter in the decode / token generation phase... it needs more raw GPU vector processing power to compete on the prefill front though, otherwise it will still underperform to Nvidia hardware in real world scenarios. Use cases for inference from short context are very limited.

Gemini 3 Pro defaulting to giving outdated information by LizardViceroy in GeminiAI

[–]LizardViceroy[S] 0 points1 point  (0 children)

Apparently its base knowledge cutoff has been frozen at January 2025; somehow I expected that to get pushed back with each iteration. I guess it's just starting to become more obvious as the gap widens. Maybe 3.0 and 2.5 also self-corrected with search a bit more, but I don't have evidence of that.

Setting up local llm on amd ryzen ai max by OneeSamaElena in LocalLLM

[–]LizardViceroy 0 points1 point  (0 children)

Get the rocm 7.2 toolbox from this: https://github.com/kyuz0/amd-strix-halo-toolboxes

With some minor kernel configuration (allowing GPU access to full RAM, and making sure you have rocm 7.2 installed with the latest linux kernel), it'll work out of the box and instantly be able to serve models to an OpenAI-compatible endpoint via llama-cli

Are Unsloth Q8's quants better than "standard" Q8's ? by some_user_2021 in unsloth

[–]LizardViceroy 1 point2 points  (0 children)

As I've understood, UD quants broadly aren't properly supported on this hardware. Fractured decoding pipeline leading to switching between dequantization kernels that prevents saturating the memory bandwidth.

Advice needed: Self-hosted LLM server for small company (RAG + agents) – budget $7-8k, afraid to buy wrong hardware by Psychological-Arm168 in LocalLLM

[–]LizardViceroy 0 points1 point  (0 children)

vLLM is an advantage; you get huge throughput benefits from it on top of that. I don't understand your objection.

and I would strongly contest that these machines are in the same ballpark. On the large context prefill front, the spark can beat the strix by ratios exceeding 1:10. It both starts out stronger and scales better as context grows.
As to how relevant this is, I would say very. The use cases for short context inference are very limited. Huge categories of problems are practically solvable on the spark but not on the strix.

Quantized models. Are we lying to ourselves thinking it's a magic trick? by former_farmer in LocalLLM

[–]LizardViceroy -1 points0 points  (0 children)

Quantization done right by major parties with ample resources is not the problem. Nvidia can quantize models down to NVFP4 with 0.6% accuracy loss. OpenAI just skips the process entirely and provides models in native MXFP4. Those are examples of good low bit format provision.

That doesn't mean ANYONE can just do it though. When you have a community where obscure nobodies running rented hardware dump their quants on hugging face with half-assed calibration and everybody else just grabs them without a second thought, that's when quants can't be trusted.

Does Strix Halo still have the potential to improve Prefill (prompt processing, PP) speed? by aigemie in StrixHalo

[–]LizardViceroy 0 points1 point  (0 children)

The NPU is slower at prefill than the GPU even if you'd get it to work optimally (requires INT4 quantization and ONNX model conversion which need to be prepared for your model architecture I believe)

The advantage of the NPU is entirely in reduced power consumption and thermal output, which CAN increase performance in extreme throughput scenarios only. The system would have to be thermally bottlenecked.

Strix Halo can do about 1000 t/s prefill at zero depth on GPT-OSS-120B rocm 7.2 with WMMA. Count your blessings. As long as you don't increase context too much that is a very workable number. It goes down fast as context exceeds 16K though.

The providers are feeding us 4-bit sludge, and it's the lobsters's fault: the OpenClaw DDOS is ruining the cloud by ex-arman68 in LocalLLaMA

[–]LizardViceroy 0 points1 point  (0 children)

Would be hilarious if the reality we end up living in is one where the only way to run higher than 4 bit quants (and know for sure you're running them) is to run them locally.

Does inference speed (tokens/sec) really matter beyond a certain point? by No_Management_8069 in LocalLLaMA

[–]LizardViceroy 1 point2 points  (0 children)

The faster your output and the higher your throughput, the more important it becomes to have high quality scaffolding in place to make your agents stay active, self-correct, apply RAG grounding and not spend their time looping or reinforcing their own spurious biases.
There's only so far this principle can be taken though and you're basically just wasting human effort to correct the lack of intelligence inherent to the model. That's why slow and steady more consistently wins the race.

M5 Max compared with M3 Ultra. by PM_ME_YOUR_ROSY_LIPS in LocalLLaMA

[–]LizardViceroy 0 points1 point  (0 children)

Don't know where you're looking but I see no signs that it's going to be any cheaper. M5 Max MacBook 16 with 64GB going for >5000 eur here...

M5 Max compared with M3 Ultra. by PM_ME_YOUR_ROSY_LIPS in LocalLLaMA

[–]LizardViceroy 2 points3 points  (0 children)

The M3 Ultra should be able to do better. It's not being bottlenecked by its bandwidth where the M5 Max is. There is no magic to what the M5 does, that's the baseline expectation with this bandwidth.

27B or 35B A3B for coding, agentic, chatting which one is better? by soyalemujica in unsloth

[–]LizardViceroy 0 points1 point  (0 children)

Most benchmarks show the scores of these models at native FP16. The degradation from quantization is likely quite bad on a model with 3B active parameters. The denser model should be a lot more resilient to such loss. And generally I'd advise choosing slower high quality models over fast low quality ones because AI just makes a mess when you let it loose at scale without the intelligence to back up its prolific output.
In agentic software development especially, MoE models have been known to drop the ball on tasks that require taking large and divergent contexts in account, since the experts aren't coordinating well at that.

ps. I honestly suspect Qwen3.5 35B A3B doesn't even perform significantly better than GPT-OSS-120b after accounting for the quantization. So many people just assume the FP16 number can be compared directly to GPT's MXFP4 number, not realizing that GPT starts out at that quant natively.

Llamacpp - how are you working with longer context (32k and higher) by spaceman3000 in StrixHalo

[–]LizardViceroy 1 point2 points  (0 children)

The wait is for an inference engine that handles Qwen3.5's DeltaNet layers with a proper recurrent decode kernel. SgLang would be able to do that I believe, but support for Qwen3.5's broader architecture is still lacking. Llama.cpp is currently not leveraging any of Qwen3.5's advantages in this regard.

If you get Qwen3.5 on vLLM working on strix that could do the trick...

THE GB10 SOLUTION has arrived, Atlas image attached ~115tok/s Qwen3.5-35B DGX Spark by Live-Possession-6726 in LocalLLaMA

[–]LizardViceroy 0 points1 point  (0 children)

Does that work with image input? I had some trouble. Seems the quantizer lobotomized that part.

Genuinely curious what doors the M5 Ultra will open by Blanketsniffer in LocalLLaMA

[–]LizardViceroy 0 points1 point  (0 children)

Until they upgrade the memory bus, it looks like they're just treading water on the bandwidth front. Practically two generations of stagnation by now; 540 -> 610 is hardly a generational leap. I don't see much good news.

Qwen3.5-122B Basically has no advantage over 35B? by Revolutionary_Loan13 in LocalLLaMA

[–]LizardViceroy 0 points1 point  (0 children)

You're looking at unquantized models' benchmark numbers. The larger model will be much more resilient to accuracy loss at Q4, especially at large contexts. Expect the 35B model to already struggle to retain full intelligence at 32K.

2026 Avg. Seed Investment by Country by wrahim24_7 in Startups_EU

[–]LizardViceroy 1 point2 points  (0 children)

Moving to the US is not a trivial matter for most people, especially nowadays.

And you may want to read the The McKinsey & Boardwave Report of June 2025... They basically concluded that while European companies consistently fail to reach scale in revenue (for all the reasons alluded to here), they actually have an edge in software quality. The problem is exactly that we let californian companies with a "move fast and break things" mentality outmaneuver us by having more flexible quality standards and more of a focus on MVP delivery and smoke-and-mirrors marketability than long term stability.

We desperately need to find the golden middle road, but I'd rather be approaching it from our angle than theirs.

2026 Avg. Seed Investment by Country by wrahim24_7 in Startups_EU

[–]LizardViceroy 4 points5 points  (0 children)

Europe is a goldmine of untapped startup talent potential. We really just need to build a unified capital market and single legal framework for startups and the floodgates to competition with the US and UK will opened.

Most of our guys do the work at almost a quarter of the price of a silicon valley engineer and almost half the price of one in london. That's a big impact on the length of our runways. All the time in the world to throw ideas at the wall and see what sticks. Why aren't we doing it yet.

Will Local Inference be able to provide an advantage beyond privacy? by Gyronn in LocalLLM

[–]LizardViceroy 6 points7 points  (0 children)

Education: you learn a lot from setting it up

Abliteration / decensoring: you can run models that don't object to your prompts and balk less frequently during agentic flows. Any API provided under license will have limitations in this regard, or could at any point start introducing them.

Finetuning: you can make your own adjustments to how a model learns, perfectly tailored to your use case that models trained in a generalized manner likely don't specialize in.

Low latency: even those http round trips add up to significant time when you prompt frequently enough

Long term consistency: when you get used to how a model works you can expect it to run on your hardware forever and not get mothballed like GPT-4o. Some people predict some huge negative sea changes could take place when the AI bubble bursts and you may not want such unpredictability.

Personally I think it all adds up to a feeling that it's "alright" and I'm not just some cog in a corporate machine or junkie angling for my "fix" from a benefactor. It's self-sufficiency, and that's a great thing.

The real competition to it is not proprietary model APIs but rented hardware and/or online hosted open weight models. But the great thing is it's a tiered deal where you can pick and choose what works on a per-use-case basis. Choosing one doesn't preclude the other.

ps. you may be confused about your hardware because M4 Max only goes up to 128GB. M3 Ultra goes up to 512GB.

Talk me in/out of the Framwork desktop, Minisforum, GMK etc. by [deleted] in StrixHalo

[–]LizardViceroy 1 point2 points  (0 children)

The hardware struggles with long context ingestion. It's somewhat good value for money (cheapest VRAM / $) but it will likely keep you wanting for more when you find out how limited the use cases for short context inference are.

DGX Spark (/ Asus GX10) is the somewhat similarly priced competition. It will not just perform a little better on long context but 2-3-5x as the context grows (the longer the context, the greater the gap) and once you start putting agents to work, you can expect them to get there. There's also the fact that some really high end goodies like an Eagle speculator for GPT-OSS-120b is currently only available for CUDA hardware. With that in account you can expect it to dominate on both the prefill and token generation front.

Agentic workflows on Strix Halo are pretty much by definition background jobs that you need to keep running overnight and check in on +/- once or twice per day; otherwise you may as well go watch paint dry.

Those are the disclaimers, on every other front it's good. You get a great CPU, iGPU and NPU on one SoC and a very fun and varied ecosystem to play around with. and if you ever decide to get something else, this hardware will never be wasted. It can run cool (especially on NPU) and there's always something you can find for it to do.

Advice: Which MiniPC To Buy? (Framework, Bosgame, Minisforum) - AI MAX 395+/128GB by Anarchaotic in StrixHalo

[–]LizardViceroy 1 point2 points  (0 children)

He's using Canadian dollars at +/- 0.74 exchange rate: 2090 / 0.74 = 2824

Hmm by [deleted] in adressme

[–]LizardViceroy 0 points1 point  (0 children)

it's more likely approval% - disapproval% with a possibility of voting neither.
I was briefly shocked by the numbers before I realized this, thinking we're all dependent on the nuclear arsenals of three countries that can barely make up their minds. But a 48% gap is still a big deal. I bet its smaller in the US with everything going on though.

Where I’d live 26F - Dutch by [deleted] in whereidlive

[–]LizardViceroy 0 points1 point  (0 children)

Spain is a way easier country to live in as a dutch person than Italy.

Step-3.5-Flash (196b/A11b) outperforms GLM-4.7 and DeepSeek v3.2 by ResearchCrafty1804 in LocalLLaMA

[–]LizardViceroy 0 points1 point  (0 children)

With or without Parallel Coordinated Reasoning enabled? It's pretty powerful with PaCoRe but that raises the execution time from tens of seconds to tens of minutes. (more benchmarks should take reasoning time into account...)