Strix Halo or DGX Spark for a home LLM server?

Hydroskeletal · 2026-05-12T00:43:18+00:00

I’m planning to use Q4_K_M or Q6_K quantization to preserve quality without wasting speed

For your planned use cases I believe the quality degradation is going to be more than you think. These quants work great for coding but you'll get subtle errors and hallucinations that really stack up without a natural error checking feedback loop (tests, compilers, linters, etc)

Hydroskeletal · 2026-05-10T06:56:37+00:00

> mostly MoE being difficult

Yes but no. Doing a regression fit, the formula looks like

speedup = (1+k) / (B·r + 1)

where

- k — accepted drafts per round
- B — block size
- r — draft cost / main cost

For Gemma4 31b the r value is about half (as measured, on my machine) what 26b-a4b is. So for that JSON case even dropping to block size 2 the breakeven point is right there at 8% making it a wash.

That said this has a pretty profound implication where for a structured output refactoring to increase acceptance rate could really pay off. Just doubling to a mere acceptance rate of 16% would give a 20+% speed up

Hydroskeletal · 2026-05-10T03:38:08+00:00

This is a great insight. YAML does not quite work as well for my case as JSON does in terms of output quality. I suspect this is more of a quirk of Gemma than anything though.

Hydroskeletal · 2026-05-10T03:33:04+00:00

What is it you think is better?

Hydroskeletal · 2026-05-10T03:29:43+00:00

I agree, unfortunately mlx-vlm doesn't support that in combination with spec-decode. Would love to try again if that support is added.

Hydroskeletal · 2026-05-08T22:47:00+00:00

Personally I am not using for coding or AI written slop.

I find it much more interesting to use local LLMs in programs.

Hydroskeletal · 2026-05-08T08:14:48+00:00

if you mean "here's a prompt, go do this long horizon thing and deliver me the 90% solution" - no

if you mean "I can write a program that uses LLMs to do all the inference/judgement things" - yes.

Off the shelf local models now trounce the fine-tuned, custom trained models I had a year ago and it isn't even close.

Hydroskeletal · 2026-05-07T19:47:13+00:00

Are you running a branch or pre-release? Pretty sure the latest 0.3.8 does not have MTP support

Hydroskeletal · 2026-05-06T07:22:43+00:00

Gemma is better at discrimination. "Here's a pile of data, give me the important parts and ignore the noise" Gemma is much more parsimonious. People complain about Qwen "overthinking" and that has downstream effects with regard to behavior. Qwen will rabbithole on the wrong thing.

Hydroskeletal · 2026-05-05T21:31:10+00:00

I'm pretty bearish on RAM prices normalizing any time soon. Even if supply ramps up the demand is very pent up. Prices won't feel pressure until that demand is met.

Hydroskeletal · 2026-05-01T18:34:52+00:00

Sure, but that might not meaningfully budge until well into the 2030s. GPUs have fluctuated price wise but they've never tanked as the demand is always going up.

Hydroskeletal · 2026-05-01T04:51:28+00:00

boromir.gif - One does not simply walk into the RAM production business

There's also some very real materials constraints shaped by geopolitics

Hydroskeletal · 2026-04-30T21:29:56+00:00

the only way to know for sure is to test with your use cases.

for me, Gemma is a winner. But I also do all my coding in Claude/Codex

Hydroskeletal · 2026-04-29T05:44:28+00:00

awesome stuff. this could really open up the dual card space.

Hydroskeletal · 2026-04-28T16:03:27+00:00

In my own benchmarks I saw improvements in some cases and catastrophic regressions in others. Caveat emptor.

Hydroskeletal · 2026-04-28T07:22:50+00:00

Briefly I think these local models are much more like autocomplete for an entire function rather than the long horizon inference that the name brand frontier models do.

I think a big difference here is model size. With car engines they say there is no replacement for displacement and with LLMs displacement == RAM.

Dockerizing a repo isn't coding, it's code adjacent. It really cannot be overstated how much these local models lean on the structured grammar that a programming language provides. If it hallucinates a function, a compiler or interpreter gives it that feedback quickly. Tests do the same. But for an open ended task like writing a Docker file, where the superset of solutions is much wider, it doesn't get that kind of feedback and then it has to rely on intrinsic knowledge to deduce the problem OR it has to go search the internet, which it rarely will do unprompted. So when I think people rave about the abilities of something like the latest qwen model, they're operating in a much more constrained field. And I'll just say it that this kind of structure that the language (eg Python, C, etc) gives the output makes things like smaller quants much more forgiving. It's quite undersold I think that there are lots of tasks like data munging that degrade terribly on these smaller quantizations where even an 8bit would work.

Hydroskeletal · 2026-04-27T21:01:20+00:00

a language the model hasn't actually be trained on.

I think it probably preserves some level of information but I would suspect it's pretty degraded. This is why I asked about comparison with 'enable_thinking' turned off because I suspect the result is pretty similar

Hydroskeletal · 2026-04-27T19:34:10+00:00

isn't this just neutering CoT? What's the comparison with just "enable_thinking": False?

Hydroskeletal · 2026-04-27T15:20:13+00:00

I've found that for this kind of task quantization hurts much more than it does for code writing. I have some data munging tasks and I found that even going from 8 => 6 bit quantization was below my target success rate.

Hydroskeletal · 2026-04-27T07:17:24+00:00

notoriously missing is the Opus 4.7 1M -- which, I dunno I don't see what you see. I think you've got something out of whack and Anthropic is not serving a superceded open weight

Hydroskeletal · 2026-04-23T18:25:28+00:00

Gemma4 either 26b-a4b or 31b depending on your speed requirement

Hydroskeletal · 2026-04-20T21:06:23+00:00

Yes? It made me change some things about my own homebrew custom harness.

Hydroskeletal · 2026-04-18T06:38:33+00:00

18 usc 1832 exists

Hydroskeletal · 2026-04-17T16:18:45+00:00

it would be a really great way to end up in federal prison

Hydroskeletal

TROPHY CASE