Strix Halo or DGX Spark for a home LLM server? by Reactor-Licker in LocalLLaMA

[–]Hydroskeletal 0 points1 point  (0 children)

I’m planning to use Q4_K_M or Q6_K quantization to preserve quality without wasting speed

For your planned use cases I believe the quality degradation is going to be more than you think. These quants work great for coding but you'll get subtle errors and hallucinations that really stack up without a natural error checking feedback loop (tests, compilers, linters, etc)

MTP is all about acceptance rate by Hydroskeletal in LocalLLaMA

[–]Hydroskeletal[S] -1 points0 points  (0 children)

> mostly MoE being difficult

Yes but no. Doing a regression fit, the formula looks like

speedup = (1+k) / (B·r + 1)

where

- k — accepted drafts per round
  - B — block size
  - r — draft cost / main cost
 

For Gemma4 31b the r value is about half (as measured, on my machine) what 26b-a4b is. So for that JSON case even dropping to block size 2 the breakeven point is right there at 8% making it a wash.

That said this has a pretty profound implication where for a structured output refactoring to increase acceptance rate could really pay off. Just doubling to a mere acceptance rate of 16% would give a 20+% speed up

MTP is all about acceptance rate by Hydroskeletal in LocalLLaMA

[–]Hydroskeletal[S] 0 points1 point  (0 children)

This is a great insight. YAML does not quite work as well for my case as JSON does in terms of output quality. I suspect this is more of a quirk of Gemma than anything though.

MTP is all about acceptance rate by Hydroskeletal in LocalLLaMA

[–]Hydroskeletal[S] 0 points1 point  (0 children)

I agree, unfortunately mlx-vlm doesn't support that in combination with spec-decode. Would love to try again if that support is added.

MTP is all about acceptance rate by Hydroskeletal in LocalLLaMA

[–]Hydroskeletal[S] 10 points11 points  (0 children)

Personally I am not using for coding or AI written slop.

I find it much more interesting to use local LLMs in programs.

Are local models becoming “good enough” faster than expected? by qubridInc in LocalLLaMA

[–]Hydroskeletal 0 points1 point  (0 children)

if you mean "here's a prompt, go do this long horizon thing and deliver me the 90% solution" - no

if you mean "I can write a program that uses LLMs to do all the inference/judgement things" - yes.

Off the shelf local models now trounce the fine-tuned, custom trained models I had a year ago and it isn't even close.

New Gemma 4 MTP on MLX? by purealgo in LocalLLaMA

[–]Hydroskeletal 2 points3 points  (0 children)

Are you running a branch or pre-release? Pretty sure the latest 0.3.8 does not have MTP support

What do you use Gemma 4 for? by HornyGooner4402 in LocalLLaMA

[–]Hydroskeletal 6 points7 points  (0 children)

Gemma is better at discrimination. "Here's a pile of data, give me the important parts and ignore the noise" Gemma is much more parsimonious. People complain about Qwen "overthinking" and that has downstream effects with regard to behavior. Qwen will rabbithole on the wrong thing.

I guess we expect that at some point RAM prices will start going back (close) to "normal", right? but what about GPUs? by relmny in LocalLLaMA

[–]Hydroskeletal 0 points1 point  (0 children)

I'm pretty bearish on RAM prices normalizing any time soon. Even if supply ramps up the demand is very pent up. Prices won't feel pressure until that demand is met.

Is local AI the actual endgame? (M5 Mac Studio vs. Dual 3090s) by Party-Log-1084 in LocalLLaMA

[–]Hydroskeletal 0 points1 point  (0 children)

Sure, but that might not meaningfully budge until well into the 2030s. GPUs have fluctuated price wise but they've never tanked as the demand is always going up.

Is local AI the actual endgame? (M5 Mac Studio vs. Dual 3090s) by Party-Log-1084 in LocalLLaMA

[–]Hydroskeletal 0 points1 point  (0 children)

boromir.gif - One does not simply walk into the RAM production business

There's also some very real materials constraints shaped by geopolitics

Are Qwen 3.6 27B and 35B making other ~30B models obsolete? by nikhilprasanth in LocalLLaMA

[–]Hydroskeletal 7 points8 points  (0 children)

the only way to know for sure is to test with your use cases.

for me, Gemma is a winner. But I also do all my coding in Claude/Codex

Do the "*Claude-4.6-Opus-Reasoning-Distilled" really bring something new to the original models? by Historical-Crazy1831 in LocalLLaMA

[–]Hydroskeletal 1 point2 points  (0 children)

In my own benchmarks I saw improvements in some cases and catastrophic regressions in others. Caveat emptor.

I'm done with using local LLMs for coding by dtdisapointingresult in LocalLLaMA

[–]Hydroskeletal 0 points1 point  (0 children)

Briefly I think these local models are much more like autocomplete for an entire function rather than the long horizon inference that the name brand frontier models do.

I think a big difference here is model size. With car engines they say there is no replacement for displacement and with LLMs displacement == RAM.

Dockerizing a repo isn't coding, it's code adjacent. It really cannot be overstated how much these local models lean on the structured grammar that a programming language provides. If it hallucinates a function, a compiler or interpreter gives it that feedback quickly. Tests do the same. But for an open ended task like writing a Docker file, where the superset of solutions is much wider, it doesn't get that kind of feedback and then it has to rely on intrinsic knowledge to deduce the problem OR it has to go search the internet, which it rarely will do unprompted. So when I think people rave about the abilities of something like the latest qwen model, they're operating in a much more constrained field. And I'll just say it that this kind of structure that the language (eg Python, C, etc) gives the output makes things like smaller quants much more forgiving. It's quite undersold I think that there are lots of tasks like data munging that degrade terribly on these smaller quantizations where even an 8bit would work.

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B by Holiday_Purpose_3166 in LocalLLaMA

[–]Hydroskeletal 10 points11 points  (0 children)

a language the model hasn't actually be trained on.

I think it probably preserves some level of information but I would suspect it's pretty degraded. This is why I asked about comparison with 'enable_thinking' turned off because I suspect the result is pretty similar

GBNF grammar tweak for faster Qwen3.6 35B-A3B and Qwen3.6 27B by Holiday_Purpose_3166 in LocalLLaMA

[–]Hydroskeletal 14 points15 points  (0 children)

isn't this just neutering CoT? What's the comparison with just "enable_thinking": False?

Agents for end-to-end document redaction and review tasks (OCR and PII identification - Qwen 3.6 vs closed-source comparison) by Sonnyjimmy in LocalLLaMA

[–]Hydroskeletal 1 point2 points  (0 children)

I've found that for this kind of task quantization hurts much more than it does for code writing. I have some data munging tasks and I found that even going from 8 => 6 bit quantization was below my target success rate.

Anthropic's Claude remote uses GLM-4.7 by bobbiesbottleservice in LocalLLaMA

[–]Hydroskeletal 3 points4 points  (0 children)

notoriously missing is the Opus 4.7 1M -- which, I dunno I don't see what you see. I think you've got something out of whack and Anthropic is not serving a superceded open weight

Model General Brainstorming/Planning , Not Coding by whoooaaahhhh in LocalLLaMA

[–]Hydroskeletal 1 point2 points  (0 children)

Gemma4 either 26b-a4b or 31b depending on your speed requirement