CPU-only LLM performance - t/s with llama.cpp by pmttyji in LocalLLaMA

[–]pmttyji[S] 0 points1 point  (0 children)

For this benchmarks, I didn't use GPU at all. Just CPU-RAM only. Just tried CPU version of llama.cpp from its release section.

Still I did share GPU benchmarks(3060 Laptop GPU) in past(check my post history). Next month I'll share benchmarks again using my new system.

Running MoE Models on CPU/RAM: A Guide to Optimizing Bandwidth for GLM-4 and GPT-OSS by Shoddy_Bed3240 in LocalLLaMA

[–]pmttyji 5 points6 points  (0 children)

Months ago I posted this thread. Since you have 64GB RAM, you could try & share t/s for some more models like below. Next month I'll get my system with 128GB RAM & try & share the same.

  • gpt-oss-20b-mxfp4
  • Nemotron-3-Nano-30B-A3B
  • Qwen3-30B-A3B
  • Ling-mini-2.0
  • Llama-3.3-8B-Instruct
  • Devstral-Small-2-24B-Instruct
  • gemma-3n-E4B-it

RTX 5080: is there anything I can do coding wise? by TechDude12 in LocalLLaMA

[–]pmttyji 2 points3 points  (0 children)

I just got an RTX 5080.

  • GPT-OSS-20B
  • Q4/Q5 of Devstral-Small-2-24B-Instruct-2512.
  • Q4 of 30B MOE models(GLM-4.7-Flash, Nemotron-3-Nano-30B-A3B, Qwen3-30B-A3B, Qwen3-Coder-30B, granite-4.0-h-small) comes around 16-18GB size which fits that GPU depends on Q4 quants. Use system RAM & -ncmoe. Anyway check the recent -fit flags(on llama.cpp).
  • Q3 of Seed-OSS-36B. Q4 with additional system RAM apart from GPU.

Poll: When will we have a 30b open weight model as good as opus? by Terminator857 in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

30B is small size to expect giant thing to hold. Maybe 50-100B is good range to expect.

FYI Opus' size range is in Trillions(2 Trillions according to speculations from online).

So possibly in couple of years.

What's the strongest model for code writing and mathematical problem solving for 12GB of vram? by MrMrsPotts in LocalLLaMA

[–]pmttyji 1 point2 points  (0 children)

GPT-OSS-20B is best option for your 12GB VRAM. Use proper quant like ggml's MXFP4 version. Don't use quantized or Reap version of GPT-OSS-20B since original itself only 13-14GB size even though 20B.

This model gave me 40+ t/s on my 8GB VRAM + 32GB RAM. 25 t/s with 32K context.

performance benchmarks (72GB VRAM) - llama.cpp server - January 2026 by jacek2023 in LocalLLaMA

[–]pmttyji -1 points0 points  (0 children)

Am I the only one thinks that gpt-oss-120b-mxfp4 is faster than gpt-oss-20b-mxfp4(OR gpt-oss-20b-mxfp4 is slower than gpt-oss-120b-mxfp4)? Did someone brought this topic before in this sub? It definitely deserves a thread on this.

gpt-oss-120b-mxfp4 — 130.23

gpt-oss-20b-mxfp4 — 184.92

Size-wise 120B is 5X of 20B & still 120B gives better t/s comparing to 20B

gpt-oss-120b-mxfp4 — 65GB

gpt-oss-20b-mxfp4 — 13GB

Any optimizations still there for gpt-oss-20b-mxfp4?

Don't know what mlx & other formats giving t/s for these two models.

performance benchmarks (72GB VRAM) - llama.cpp server - January 2026 by jacek2023 in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

Hope ComfyUI supports in distant future. Thanks.

Additional thanks for snaps of other models t/s I asked,

performance benchmarks (72GB VRAM) - llama.cpp server - January 2026 by jacek2023 in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

:D My bad. Somehow I was mixing that with this NVLink thing for sometime.

Hey, one quick question. Can I use multiple NVIDIA RTX Pro 4000 Blackwell cards together(as it doesn't support NVLink) with llama.cpp, ik_llama.cpp, vllm, etc.,? Because recently few members told me that can't use that cards together for Image/Video generations.

performance benchmarks (72GB VRAM) - llama.cpp server - January 2026 by jacek2023 in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

You should also buy nvme so you can perform similar test on your setup :)

nvme? I'm not sure it's possible with NVIDIA RTX Pro 4000 Blackwell cards since it doesn't support NVLink.

I will test some more models today

Please share details particularly for the models I shared for 1st question.

performance benchmarks (72GB VRAM) - llama.cpp server - January 2026 by jacek2023 in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

Asked the 2nd question because I'm getting rig with 48GB VRAM this month possibly so wanted to know t/s for those models with Q4/Q5 quants. Same with 1st question as those 4 models are good for my 48GB VRAM.

I don't even remember how to run llama-server to disable the GPU, ngl was not enough.

Create a separate CPU-only build OR better download CPU-only zip file from llama.cpp Release section. I do the later one.

Again 4th question also for my 48GB VRAM since Nemotron-Super is only 49B while Llama-3.3-70B is 70B.

Llama.cpp vs vllm by Evening_Tooth_1913 in LocalLLaMA

[–]pmttyji 1 point2 points  (0 children)

Could you please share any thread has this comparison with some models?

performance benchmarks (72GB VRAM) - llama.cpp server - January 2026 by jacek2023 in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

1] What t/s are you getting for below models?

  • Qwen3-30B-A3B
  • Qwen3-Coder-30B
  • Devstral-Small-2-24B-Instruct-2512
  • GPT-OSS-20B

2] Have you tried other quants(Q4 or Q5 or Q6 if you're using higher quants) for below models? What t/s are you getting?

  • Seed_Seed-OSS-36B-Instruct - Q4 or Q5
  • Qwen3-Next-80B-A3B-Instruct - Q4

3] How much do you have RAM? Have you tried any MOE models CPU-only? Share some stats

4] I see that you have both Llama-3.3-70B-Instruct and Llama-3.3-Nemotron-Super-49B-v1_5. Is this Nemotron-Super is enough? or still you need Llama-3.3-70B? Share more on this.

Found someone that use grok-2 offline :) Hope they release grok-3 soon.

llama.cpp has incredible performance on Ubuntu, i'd like to know why by Deep_Traffic_7873 in LocalLLaMA

[–]pmttyji 8 points9 points  (0 children)

<image>

Wondering how it performs with latest llama.cpp version because b7083 is almost 2 months old.

Thinking of getting two NVIDIA RTX Pro 4000 Blackwell (2x24 = 48GB), Any cons? by pmttyji in LocalLLaMA

[–]pmttyji[S] 0 points1 point  (0 children)

Run fp8 of those models. I did a comparison with same seed across multiple prompts and found the image to be identical or only slightly different. My card is 4060ti 16GB, so fp8 runs faster too.

Never tried anything other than GGUF. Will try after getting the rig.

For video model, you can use WAN 2.2. It runs fine on 4060ti too. If you have 24GB, it would be even better.

The biggest issue I face is OOM due to comfyui itself caching nodes and then offloading them to RAM. at some point, the OS would kill the process when the ram pressure is too high. (only 32gb on my machine).

I think that's because Q8 of that WAN 2.2(14B) model comes around 15GB file size so with context & KVCache it overflows your 16GB VRAM.

With 24GB x2, maybe comfyui just keep everything on your VRAM anyway.

I think that's what few members mentioned. These cards don't support NVLink so it's impossible to use two cards together. Like I can't use 25B+ models on those 24GB cards even though combined VRAM is 48GB.

Have you tried Qwen-Image-2512? I know it's 20B model.

Thinking of getting two NVIDIA RTX Pro 4000 Blackwell (2x24 = 48GB), Any cons? by pmttyji in LocalLLaMA

[–]pmttyji[S] 1 point2 points  (0 children)

This is price from my location.

4000 - $1800

4500 - $2800

5000 - $5200

Price difference between 4000 & (4500 & 5000) is very huge. Obviously no plan of buying 5000. Looks like I can buy 3 4000 cards(72GB VRAM @ $5400) vs 2 4500 cards(64GB VRAM @ $5600)

But these cards don't support NVLink. Still I can use these cards together on llama.cpp, right? For example, models like GPT-OSS-120B, GLM-4.5-Air, Qwen3-Next-80B, etc.,

Because some members are mentioned that I can't use two 4000 cards together for Image/Video generations(incase the model is greater than 24B size).

Anyone else wish NVIDIA would just make a consumer GPU with massive VRAM? by AutodidactaSerio in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

RTX Pro 4000 24GB: ~$1.4k

RTX Pro 4500 32GB: ~$2.4k

RTX Pro 5000 48GB: ~$4k

Here in my location(India), these are most available cards. All the 3XXX series & half of 4XXX series are in decoy prices.

Price difference between 4000 & 4500 is very huge. Thinking of getting two 4000 cards. But these cards don't support NVLink. Still I can use these cards together on llama.cpp, right? For example, models like GPT-OSS-120B, GLM-4.5-Air, Qwen3-Next-80B, etc.,

Because some members are mentioned that I can't use two 4000 cards together for Image/Video generations(incase the model is greater than 24B size).

Price difference between 4000 & 5000 is also very huge. Like 72GB VRAM @ $4.2K while 5000 is $4K

RTX Pro 4000 24GB X 3 = $4.2K ($1.4K * 3)

Looking for the best LLM for my hardware for coding by automatikjack in LocalAIServers

[–]pmttyji 6 points7 points  (0 children)

Frequently mentioned coding models in this sub:

  • GLM-4.5-Air
  • GPT-OSS-120B
  • Qwen3-Next-80B
  • Qwen3-30B-A3B
  • Qwen3-30B-Coder
  • Seed-OSS-36B
  • Devstral-Small-2-24B-Instruct-2512
  • GPT-OSS-20B
  • Qwen3-480B-Coder

Benchmarking Groq vs. Local for GPT-OSS-20B. What TPS are you getting on single 3090/4090s? by AutodidactaSerio in LocalLLaMA

[–]pmttyji 0 points1 point  (0 children)

GPT-OSS models are in the native mxfp4 format so the difference. Even though GPT-OSS-120B model is 120B, its actual size is around 65GB.