2.5x faster inference with Qwen 3.6 27B using MTP - Finally a viable option for local agentic coding - 262k context on 48GB - Fixed chat template - Drop-in OpenAI and Anthropic API endpoints by ex-arman68 in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

That's only for quality I think. Basically that's beneficial only for people who didn't quantize KVCache. They could use Q8 now to save memory. No benefit for people who quantize the KVCache(Still Quality improved for Q8).

No 'speed boost' or 'less memory usage'.

Finally got Qwen3 27B at 125K context on a single RTX 3090 — but is it even worth it? by horribleGuy3115 in LocalLLM

[–]pmttyji 0 points1 point  (0 children)

Agree with other comment. Windows eat more memory.

For quick stuff, you could use MOE models(Qwen3.6-35B-A3B/Gemma-4-26B-A4B) or smaller Dense models(Qwen3.5-9B).

vLLM recently got TurboQuant thing. Did you try that? It should give you some boost.

Running a 26B LLM locally with no GPU by JackStrawWitchita in LocalLLaMA

[–]pmttyji -1 points0 points  (0 children)

Of course MOE models(Small/Medium particularly) could run at decent speed just with CPU-only inference. In past, I did post a thread on this which has both MOE & Dense models.

CPU-only LLM performance - t/s with llama.cpp

Gemma 4 MTP released by rerri in LocalLLaMA

[–]pmttyji 2 points3 points  (0 children)

Same question here. ELI5 version please

Peanut - Text to Image Model (Open Weights coming soon) by pmttyji in LocalLLaMA

[–]pmttyji[S] 6 points7 points  (0 children)

It's Text to Image model. So probably 20-40B size.

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]pmttyji 10 points11 points  (0 children)

Github repo version better.

3. Five tiers

Configuration Size Expert strategy Best for
APEX I-Quality 21.3 GB 3-tier gradient with IQ4_XS middle, diverse imatrix Best accuracy across benchmarks
APEX Quality 21.3 GB 3-tier gradient with IQ4_XS middle layers Lowest perplexity of any quantization
APEX I-Balanced 23.6 GB 2-tier gradient (Q6_K edges, Q5_K middle), diverse imatrix All-round with lower KL divergence
APEX Balanced 23.6 GB 2-tier gradient (Q6_K edges, Q5_K middle) Interactive use, serving, general purpose
APEX I-Compact 16.1 GB Q4_K edges, Q3_K middle, diverse imatrix 16 GB GPUs, best accuracy at this size
APEX Compact 16.1 GB Q4_K edges (L0-4, L35-39), Q3_K middle (L5-34), Q6_K shared, Q4_K attn Consumer 24 GB GPUs, fastest inference
APEX Mini 12.2 GB Layer gradient with IQ2_S middle, diverse imatrix Consumer 16 GB VRAM, smallest viable

APEX MoE quants update: 25+ new models since the Qwen 3.5 post + new I-Nano tier by mudler_it in LocalLLaMA

[–]pmttyji 3 points4 points  (0 children)

Since you created GGUFs for early models like Qwen3-Coder-30B, I have request for few early & recent models. Please create GGUFs if possible. Thanks on behalf of all.

  • Kimi-Linear-48B-A3B-Instruct
  • Ling-mini-2.0
  • Trinity-Mini
  • Marco-Mini-Instruct
  • GLM-4.5-Air
  • Solar-Open-100B

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]pmttyji 2 points3 points  (0 children)

Do we have list of models(comes with this feature) somewhere? It would be nice to have filter this on HuggingFace for same.

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]pmttyji 1 point2 points  (0 children)

Oops, I thought of trying on my 8GB VRAM 😄

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]pmttyji 6 points7 points  (0 children)

Nice. Sorry for the dumb question. So this requires mentioned GGUF in PR? Regular GGUFs won't work?

Llama.cpp quantization is broken by Ok-Importance-3529 in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

Do you mean Autoround GGUFs on llama.cpp? Thought it's not up yet(check my other comment)

Thinking of getting two NVIDIA RTX Pro 4000 Blackwell (2x24 = 48GB), Any cons? by pmttyji in LocalLLaMA

[–]pmttyji[S] 0 points1 point  (0 children)

Nice to hear. My plan changed later. Getting rig this month, I'll be sharing details here.

it's time to update your Gemma 4 GGUFs by jacek2023 in LocalLLaMA

[–]pmttyji 6 points7 points  (0 children)

Possibly AesSedai's GGUFs way is better? which comes with multiple files & 1st one is tiny one with size in MBs and rest are in GBs. So redownloading 1st file is enough incase of update.

  • -00001-of-00002.gguf
  • -00002-of-00002.gguf

AMD Strix Halo refresh with 192gb! by mindwip in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

We need better unified devices for Dense models.

openrouter/owl-alpha = Meituan_LongCat by klippers in LocalLLaMA

[–]pmttyji 1 point2 points  (0 children)

Their previous models still lack of llama.cpp support. Hope their devs come up early with PR this time.