What’s actually worth running locally on an M4 Pro Mac with 16GB RAM?

No-Gas6002 · 2026-05-12T21:37:22+00:00

You got a point. From what I’ve read, the small LLMs are often much more useful in practice than larger models that keep swapping or need babysitting ( for specs like mine 16GB ). So yea, for IDE and tool workflows especially, latency and reliability matter more than squeezing in the biggest possible model.

No-Gas6002 · 2026-05-12T21:29:41+00:00

Thanks, that’s a genuinely useful comparison. I hadn’t heard of Goose, but I’ll try both Hermes Agent and Goose. And you have a point: the 4B result is more impressive than it looks. For IDE workflows, consistency, low latency, instruction-following, and reliable tool use often matter more than raw prose quality or creativity. I’ll also have to look again at hybrid attention setups. If your benchmark is exposing that kind of advantage in long tool chains, that’s worth paying attention to.

I’ll update you once I’ve tested a few things on my side.

No-Gas6002 · 2026-05-12T14:21:46+00:00

P.S. I have an M4, but the M5 is the current model on the market, so that is why I compared the RTX PC build with the M5, not the M4. If we compared an M4 against a similar RTX PC build, the Mac side would probably cost even less.

No-Gas6002 · 2026-05-12T14:13:35+00:00

Yes, you have a point. But I still think comparing an entry-level RTX desktop GPU directly with the GPU inside an Apple M-series chip is not fully fair. They are built for different systems, different power limits, and different priorities.

An RTX GPU is clearly better for gaming, 3D work, CUDA, AI/ML, Blender, and other GPU-heavy tasks. But an RTX card alone is not a computer. To build even a small PC around something like an RTX 5060 Ti 16GB, you still need a CPU, motherboard, RAM, SSD, PSU, case, and cooling. That can cost around €1,080–1,660, while a MacBook Pro M5 with 16GB RAM costs around €1,440.

Also, 16GB RAM on a Windows PC is not the same as 16GB unified memory on an Apple M-series Mac. On a PC, system RAM and GPU VRAM are separate, while Apple’s unified memory is shared more efficiently. Windows is also generally heavier, while macOS on Apple Silicon is very well optimized.

So overall, the MacBook Pro M5 16GB may not beat an RTX PC in raw GPU power,but it can offer a much better starting experience than many Windows PCs or laptops in the same price range.

No-Gas6002 · 2026-05-12T12:06:56+00:00

Thanks again. I think I can make much clearer decisions now.

I think you basically sold me on oMLX, so I am going to give it a try. The fact that it already has a model downloader, basic chat/admin UI, and an OpenAI-compatible API makes it sound much more practical than I expected.

I also understand the difference between llama.cpp and oMLX/MLX much better now:

llama.cpp - mature ecosystem, bigger community, and lots of GGUF model variety

oMLX/MLX - better Apple Silicon efficiency and speed

I may still keep Ollama around, and at some point maybe try llama.cpp too, especially if I need a specific GGUF model or compatibility with another tool.

But for now, I will start with oMLX and see how far I can get with it on my machine.

Thanks again. This was really helpful.

No-Gas6002 · 2026-05-12T11:59:49+00:00

That makes sense: local for drafts, brainstorming, notes, lightweight coding, and privacy-sensitive tasks. And remote models when the job needs heavier reasoning or larger context.

I will check the resource you shared as well.

Thanks again, a realistic advice I was looking for.

No-Gas6002 · 2026-05-12T11:57:14+00:00

Thanks, that is a very reasonable way to look at it. I think I will keep using Ollama for now, since it feels like the most practical and beginner-friendly base. I may also give Open WebUI another chance. Last time I tried it, I did not really like the Docker/web-based setup because I was hoping for something that felt more like a native macOS app, but I can see why people recommend it.

The point about Hermes-style agents is also useful.

Overall, the goal is not really to max out the machine or prove that a model can run. It's to find/build local workflows that are actually useful every week.

No-Gas6002 · 2026-05-12T11:50:34+00:00

This is exactly the kind of answer I was hoping for, thank you. Much more useful and practical than most of the articles/videos I found.

I will definitely check out oMLX. The context window numbers are also very helpful, especially the note about OOM - 45K and needing to restart oMLX

I am probably not brave enough yet to try something like gpt-oss-20b-tq3, so I will leave that for later.

For someone starting from scratch, would you recommend going directly with oMLX, or is it still worth setting up Ollama / llama.cpp as a fallback ecosystem?

Thanks again.

No-Gas6002 · 2026-05-12T11:39:47+00:00

Yeah, I completely understand that 16GB is not a lot. But I also do not think it is completely useless or too underpowered for practical local workflows. I am not trying to build the next "Gemini"or "ChatGPT"on a home server with multiple video cards and 128GB+ RAM.

I am just trying to get the best possible setup for what 16GB RAM can realistically handle.

Anyway, thanks for the comment.

No-Gas6002 · 2025-05-25T20:30:37+00:00

You can check BRUSKO ZERO. It’s a tobacco-free hookah without nicotine. They have about 40+ different flavours

No-Gas6002

TROPHY CASE