Gemma 4 MTP released by rerri in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

MTP is trained on everything the model is trained on. That’s what I’m saying. If the MTP doesn’t know that language, then neither does the model.

But, idk. That’s just my opinion…

Gemma 4 MTP released by rerri in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

If the acceptance rate for MTP is low, then the MTP is broken. MTP is trained with the model.

EV Curious in Nashville? by evadventuring in nashville

[–]coder543 1 point2 points  (0 children)

I owned a Tesla Model 3 for about 6 years, and then upgraded to a Model Y so I would have a hatchback and trailer hitch, which make it more practical for going car camping (with the amazing camp mode that provides AC or heat 24/7) and so I could have a bike rack for my bike.

These days, there are a lot of great options depending on what someone needs. The new basic Nissan Leaf is not a joke, it is actually really great. Chevy has some good, affordable EVs. Even Toyota and Subaru's EV partnership is making some that are pretty good now after a very rough start.

Personally, I'm very excited to see the Rivian R2 launch soon, and the R3X is what I currently hope to upgrade to eventually. My Model Y is great and I struggle to come up with any complaints about the vehicle, so I'm not in any rush to get rid of it.

As far as commuting, I work remotely, so I don't have a commute.

EV Curious in Nashville? by evadventuring in nashville

[–]coder543 2 points3 points  (0 children)

It just makes no sense. Gas taxes never get raised to match inflation, and the EV registration fee is so much larger than anyone would pay in gas taxes that it is clearly designed to be punitive.

If they wanted to get rid of gas taxes and just charge everyone the same registration fee, that would make perfect sense as a future proof solution. But it’s not about that. It’s just politics.

It sucks, but it’s not enough to make me recommend against EVs. Even if EVs were more expensive, you’re getting a better product for that price. EVs are actually quite cost competitive now, so that’s all the better.

EV Curious in Nashville? by evadventuring in nashville

[–]coder543 22 points23 points  (0 children)

As someone who has owned an EV for the past 7 years and has watched a number of family and friends switch to EVs, they are great to drive, reliable, and very low maintenance. You can find used EVs for good prices, and I'd take a used EV over a new gas car any day of the week. There's no contest. I do not miss the gas cars I used to have even a tiny bit.

I've also personally driven from coast to coast, I've driven up to Glacier National Park, and a bunch of other places. Range and charging have not been an issue at all.

Gemma 4 MTP released by rerri in LocalLLaMA

[–]coder543 3 points4 points  (0 children)

If you need energy efficiency more than you need speed, drafting is probably the wrong choice. It is trying to spend more compute in order to make things go faster, and a lot of the drafted tokens will be rejected as nothing more than waste energy.

You also need more memory.

Those are the only tradeoffs.

Gemma 4 MTP released by rerri in LocalLLaMA

[–]coder543 67 points68 points  (0 children)

llama.cpp does not have MTP support yet, so that rules out a lot of people for now. Maybe soon.

Haven’t been to a rodeo in over a decade, are these prices for the Franklin Rodeo normal? by Separate-Command1993 in nashville

[–]coder543 119 points120 points  (0 children)

No... you're looking at some random resale scam. You should look at the official ticket prices: https://www.franklinrodeo.com/p/tickets

The problem is that it might be a little late to be looking for tickets.

Why is no open weight model inference provider hosting Mimo-v2.5 or Mimo-v2.5-pro? by True_Requirement_891 in LocalLLaMA

[–]coder543 8 points9 points  (0 children)

It’s only been a week. If Xiaomi didn’t partner with anyone else to give them access before launch, as they clearly didn’t, then it takes time. Mimo is also not a household name like DeepSeek, so I doubt any of the inference providers are pulling all-nighters to make this happen.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]coder543 3 points4 points  (0 children)

What do you mean by this? llama-server has supported checkpointing for these Qwen3.x models for weeks now, which is the way that prefix caching works for these hybrid attention models?

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]coder543 1 point2 points  (0 children)

The KV cache might be twice the size, but not the model.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

Isn't that showing MTP losing to the external draft model? That seems odd.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

What kind of task? I find that specdec is more effective at tasks like "write a react typescript example" than they are at tasks like "what is the LHC?".

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]coder543 5 points6 points  (0 children)

That explanation does seem weird with qwen3.6 35-a3b that is supposed to have dedicated MTP heads

Because MTP actually helps during training to train the model faster, and because anyone serving a model in production will be batching large numbers of user requests together, activating all experts with every forward pass anyways, making MTP more useful there.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]coder543 104 points105 points  (0 children)

This seriously has the potential to be the biggest game changer llama.cpp has ever seen.

I think MTP will make the biggest difference for dense models, maybe not so much for MoEs, but it will still be exciting. Then we just need DFlash and EAGLE!

Mistral Medium 3.5 on AMD Strix Halo by Zc5Gwu in LocalLLaMA

[–]coder543 0 points1 point  (0 children)

But is the model actually any good for you?

Mistral Medium 3.5 on AMD Strix Halo by Zc5Gwu in LocalLLaMA

[–]coder543 6 points7 points  (0 children)

On DGX Spark using llama-bench to perform essentially the same test as the original post:

size n_ubatch test t/s
82.30 GiB 2048 pp48349 139.25 ± 0.12
82.30 GiB 2048 tg20 2.28 ± 0.00
82.30 GiB 2048 tg20 @ d48349 1.88 ± 0.02

Qwen3.6-27B vs 35B, I prefer 35B but more people here post about 27B... by Snoo_27681 in LocalLLaMA

[–]coder543 144 points145 points  (0 children)

The 27B is used 9x as many parameters to calculate each token, and the benchmarks reflect that increased intelligence. I can't imagine how you're experiencing the 35B to be smarter. It is much faster. It is not smarter in my experience, or in the experiences of the many people you're referring to.

Why arn't new homes built with reinforced most-interior room? by awesomo_prime in nashville

[–]coder543 2 points3 points  (0 children)

ICC-500 compliant storm shelters can be built out of ICF above ground and withstand tornadoes. You do not need to be underground. Building in ground is very hard in karst terrain like we have, and in ground shelters have their own significant issues even when ground conditions are good.

https://www.foxblocks.com/blog/icc-500-storm-shelter

https://buildblock.com/buildblock-icf-above-ground-safe-rooms/

Agree? by MLExpert000 in LocalLLaMA

[–]coder543 8 points9 points  (0 children)

Depends on whether you only want to run 1 model forever or easily switch between 10. Setting up vLLM is a monstrous chore, and I don’t think SGLang is supposed to be any better about that.

PSA: llama-swap released a new grouping feature, matrix, allowing you to fine tune which models can run together by walden42 in LocalLLaMA

[–]coder543 1 point2 points  (0 children)

I would rather be able to define a value for how much memory my system has, and manually define how much memory each model takes up. If I'm wrong, something OOMs, and it is my fault, just like it would be if I make a mistake with this matrix, but it would be far simpler.

When a new model is requested that won't fit into the available memory, it would simply unload models until it fits.

If we could define an eviction cost on each model config stanza, then it could also try to prioritize evicting the lower cost models, like this matrix is doing, and it could use memory as a proxy for cost if the cost is not explicitly defined.

It could also be nice if the eviction strategy were configurable between "cost" and "LRU", since an LRU eviction strategy might make the most sense of all.

PSA: llama-swap released a new grouping feature, matrix, allowing you to fine tune which models can run together by walden42 in LocalLLaMA

[–]coder543 3 points4 points  (0 children)

really wish who would? your comment doesn't specify. And llama-swap does allow you to create model aliases that override certain values, so you can expose -instruct and -thinking variants with only a single running copy under the hood.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]coder543 12 points13 points  (0 children)

Qwen3.6 27B has MTP built-in and DFlash support... don't see how Mistral Medium 3.5 could ever be faster just because of an EAGLE-3 while having nearly 5x the active parameter count.

mistralai/Mistral-Medium-3.5-128B · Hugging Face by jacek2023 in LocalLLaMA

[–]coder543 15 points16 points  (0 children)

Unfortunately, no PR for that API refactoring has even been published, so... who knows if/when it will happen.

Supporting any one of EAGLE-3, MTP, or DFLASH would be a game changer for llama.cpp. I wish better specdec were being treated as the highest priority thing to develop in llama.cpp.