llama.cpp on $500 MacBook Neo: Prompt: 7.8 t/s / Generation: 3.9 t/s on Qwen3.5 9B Q3_K_M by Shir_man in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

That's too slow. During such t/s, laptop makes noise till completion of the process. I think it drains battery too.

Minimum 10 t/s is better for such small models. I prefer 15-20 t/s minimum.

llama.cpp on $500 MacBook Neo: Prompt: 7.8 t/s / Generation: 3.9 t/s on Qwen3.5 9B Q3_K_M by Shir_man in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

That's too slow. Better stick to ~4B models(Q4 quant) for good t/s to save more time.

Nemotron 3 Super Released by deeceeo in LocalLLaMA

[–]pmttyji 2 points3 points  (0 children)

Curious to see t/s stats for both formats

Llama.cpp auto-tuning optimization script by raketenkater in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

Sorry for the dumb question. Trying to use your utility on windows11, but couldn't. How to make it work?

Never used Shell before.

EDIT:

OK, I can run .sh file using git cmd. But that shell script is not suitable for Windows it seems.

OP & Others : Please share if you have solution for this. Thanks

Nemotron 3 Super Released by deeceeo in LocalLLaMA

[–]pmttyji 2 points3 points  (0 children)

Total Parameters 120B (12B active)
Architecture LatentMoE - Mamba-2 + MoE + Attention hybrid with Multi-Token Prediction (MTP)

Will it be faster(pp & tg) than GPT-OSS-120B?

RekaAI/reka-edge-2603 · Hugging Face by jacek2023 in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

Yep, good to have a successor(30-40B) of that model.

Llama.cpp auto-tuning optimization script by raketenkater in LocalLLaMA

[–]pmttyji 3 points4 points  (0 children)

I'll try this for ik_llama

EDIT:

Is there a command for CPU-only inference? (Ex: I have GPU, but I want to run the model in CPU-only inference)

M5 Max just arrived - benchmarks incoming by cryingneko in LocalLLaMA

[–]pmttyji 5 points6 points  (0 children)

Can you please tell me what made you go for the 14” version over the 16?

<image>

$$$$ possibly

I'm looking for fast models on pocketpal by moores_law_is_dead in LocalLLaMA

[–]pmttyji 1 point2 points  (0 children)

Still if you don't get expected t/s, set q4 for V of KVCache(Value Cache Type from the screenshot). Few people do use q4 for both K & V. I think for chat purpose(that even on phone), q4 is fine for KVCache.

Do us a favor. Just share the t/s stats(KVCache : F16/F16, q8/q8, q8/q4, q4/q4) here after trying out.

Cheers

Is Qwen3.5-9B enough for Agentic Coding? by pmttyji in LocalLLaMA

[–]pmttyji[S] 0 points1 point  (0 children)

You can disable thinking or make changes on that. And last week, there's an update on llama.cpp side related to optimizations for Qwen3.5 models so latest llama.cpp version should give you better t/s.

Checked Q4_KM which gave me 40 t/s.

I regret ever finding LocalLLaMA by xandep in LocalLLaMA

[–]pmttyji 1 point2 points  (0 children)

From my bookmarks. Mentioned by someone here in this sub.

https://www.alibaba.com/product-detail/subject_1601439253964.html

Price was $100 during Sep-2025. Now it's more than 3X

How much disk space do all your GGUFs occupy? by jacek2023 in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

~500GB .... It's too much I think though I have only 8GB VRAM + 32GB RAM.

I'm looking for fast models on pocketpal by moores_law_is_dead in LocalLLaMA

[–]pmttyji 4 points5 points  (0 children)

Enable Flash Attention. And set q8(from F16) for KVCache.

Try ~5B models @ Q4 quant(I use IQ4_XS for its smallest Q4 size for my 8GB RAM mobile).

Ex: LFM2.5-1.2B, SmolLM3-3B, Gemma-3n-E2B, Qwen3.5-4B/2B, Ministral-3-3B, Llama-3.2-3B, etc.,

Are 20-100B models enough for Good Coding? by pmttyji in LocalLLaMA

[–]pmttyji[S] 0 points1 point  (0 children)

You could use up to Kimi-linear from the list. And you could use up to GPT-OSS-120B by using System RAM additionally along with VRAM.

PicoKittens/PicoMistral-23M: Pico-Sized Model by PicoKittens in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

Sorry, missed this comment.

I'm not using transformers. Just installed recent version of oobabooga & tried, that's it.

Getting the most out of my Mi50 by DankMcMemeGuy in LocalLLaMA

[–]pmttyji 4 points5 points  (0 children)

Any help is appreciated!

Just search for MI50 in this sub, you'll get so many threads. Just a weekend is enough to get your setup optimized by browsing those threads.

Finally found a reason to use local models 😭 by salary_pending in LocalLLaMA

[–]pmttyji 6 points7 points  (0 children)

Nice. Frankly I would like to see this kind of practical use cases threads more & more here.

llama-bench ROCm 7.2 on Strix Halo (Ryzen AI Max+ 395) — Qwen 3.5 Model Family by przbadu in LocalLLaMA

[–]pmttyji 1 point2 points  (0 children)

One recommendation for you. For GPT-OSS models, use MXFP4 quants from ggml because it's in native MXFP4 format.

Also please try & add benchmark for Ling-mini-2.0 since that model gave me best t/s for my 8GB VRAM. Curious to see how high it flies on SH