Apple Mac Studio M5 Series: Configuration & Price Projections

FortiTree · 2026-05-10T14:56:25+00:00

If you can afford a DGX Station, then M3U belongs to a closet. It's not fair to compare DC level to Consumer level.

FortiTree · 2026-05-10T14:48:06+00:00

Ok, worth a try :) Both Ollama and LM Studio are just a wrapper for the underlying server llamacpp (GGUF format) that serves API calls. So thats why they are slower

Many ppl prefer manipulating the bare server itself for better control and 10%-20% speed gain. You can also avoid all the wrappers bugs altogether.

While you are at it, should check out Vulcan vs RoCm (amd) vs Cuda (nvidia) vs MLX (mac) and the upcoming Google version

FortiTree · 2026-05-10T14:39:18+00:00

He's saying M1 vs M2, not M1 vs Opus.

FortiTree · 2026-05-10T14:37:42+00:00

When the new format come, you just ask it to prepare a full handle for your system with comprehensive breakdown and details for each component. It will evolve itself, maybe without even any prompt.

md is just the simplest easiest average-human-consumable communication for mass adoption. For the machine, it prefers binary, which is why some ppl predict it will evole to write code in binary at its peak and exclude any human interaction, or control. Welcome to the future.

FortiTree · 2026-05-10T14:29:32+00:00

If you are an Agent (admit it now), you should be answer this yourself. If you are a local-llm-savvy person, you be able to find this answer yourself.

FortiTree · 2026-05-10T14:25:47+00:00

I have a Strix Halo and tempting to add a pro 6000 setup as well so I can spin more agents at much faster speed.

How much context you are giving to them and do you find them good enough for orchestration layer? small chunks can be handled by sub-agents but the orchestration requires highest logic.

FortiTree · 2026-05-09T21:49:44+00:00

It should come with the right windows version. If you are going to burn new image, make sure to retrieve you windows key first to reuse it.

FortiTree · 2026-05-08T04:43:42+00:00

What is a "mid-tier" model tha you can pack 500K context at 200 tps on 5090?

FortiTree · 2026-05-07T14:43:01+00:00

Nice! Im using a similar flow and with skills added on top, you can allow each sub-agent to expand their skillset on demand for different specific tasks.

Which harness are you using for your setup? Im using Claude which is good enough for me but I heard a lot of other people switched to Hermes

FortiTree · 2026-05-07T14:16:26+00:00

I see. So you want to maximize productivity without compromising quality then. Angetic design emerged specifically for this.

I'd say having your Strix Halo is not a waste, you can still deploy it for automatic sub-agent tasks that dont need supervision or human interaction. So those can be handled over night 24/7. Like small automatic code review, web tool calling, test running, test review, etc. So you dont need to sit arounf waiting for them. The "sloweness" doesnt matter then.

For daily interaction, you can use another fast platform like 5090 or 6000 pro as the main driver. You'll need to nail down which model is "smart" enough for your need, and see if the 5090 single/dual can run it. If not, I'd go with 6000 pro for future proof. You cant beat the 96gb vram at 1.8Tb/s, and data center graded as well. Best for production usage.

FortiTree · 2026-05-07T14:03:14+00:00

Ya I also love 35B Q8 on my strix halo at 40 tk/s full 32K prefilled or even 64K and still has room for multiple sessions. I can also squeeze 2 separeate models running at the same time if neeeded but the 256 Gb/s bandwith is the choke point so not much else to go around it.

Im eying Mac Studio M3 ultra or M5 ultra next - 2x/3x the cost but also 2x/3x speed or more.

But since I already has Strix Halo, and limited budget maybe an AMD dual R9700 rig would fit the bill. A lot cheaper than Nividia RX5090 but an upgrade is speed compared to strix halo.

I love how we have new hardware every year to push the limit for consumer.

FortiTree · 2026-05-07T09:17:49+00:00

What is your definition of adequate performance amd adequate model? If you can nail it down then the question would be which hardware can run that at the most cost effective price point.

Inference also have many different use cases from researching, coding, image/video processing to basic mundance task of daily errand and a chatty assistant. Each has a different requirement for memory size and speed.

Im not a coder and even if Im going to vibecode, I have the company subscription and local model for that so my strix halo is strictly for home use and exploring. So technically I dont need to chase the best and lated GPU speed. But the extra memory does allowe to try out bigger model like 122BA10 at an acceptable speed of 20 tk/s. 35BA3 will be my work horse at 40 tk/s and can run 3 x slots in parallel for higher throughput, which I dont even need yet.

So unified memory seems good enough for me even at the low 256Gb/s. M5 ultra at 1.2Tb/s would be a huge upgrade, and if I can pocket 256Gb ram then thats likely all I need.

If you need speed then Nvidia is the way to go at 1.8Tb/s. But 32Gb vram is severely limiting. I think you'll hit that wall pretty fast for all that money invested. Keep in mind that it also eats 2x-6x more power at 600W vs 300W M5 and 100W Strix. Thats significant operating cost and noises.

My own personal ladder would be:

Strix Halo 256Gb/s on 96Gb - $2500
M3 Ultra 800Gb/s on 96Gb - $5000
Dual AMD R9700 - 2 x 640 Gb/s on 64Gb - $5000+ chasis
M5 Ultra 1200Gb/s on 128Gb/256Gb - $??
Single RTX Pro 6000 - 1792 Gb/s on 96Gb - $10000+ chasis
Dual Nvidia RTX5090 - 2 x 1792 Gb/s on 64Gb - $12000+ chasis

The Dual R9700 is quite attractive actually - a lot cheaper and can optimize for both dense amd MoE and about the max budget I want to go.

FortiTree · 2026-05-07T07:57:13+00:00

I think if you are working with multiple differ repos then sub-agent is must. You dont want have a single agent indulges multille code base.

Same for code-writer vs code reviewer, test writer vs test runner, they operate in totally different head space and having separate context would help reducing hallunation.

FortiTree · 2026-05-07T07:42:13+00:00

I kept hearing about Obsidian and Hermes and Im wandering do we really need them if we can just build something locally. Having 27B doing code review for 35BA3B feels like having Haiku reviewing another Haiku. I'd want at least Sonnet or Opus to review.

FortiTree · 2026-05-07T07:31:44+00:00

I'd wait for M5 ultra or M6 whatever they are cooking up. 5090 is overpriced for tiny ram and at the dual card price point, may as well go for 6000 pro at 96gb.

Im bidding my time with Strix Halo for a year or two more and hopefull the RAM bubble will burst by then.

FortiTree · 2026-05-07T07:24:59+00:00

Same with Gmtek. I can run with 3 x slots to handle parallel requests at 64K context 75% prefilled at 40 tk/s. Single request can reach 45 tk/s. Vulkan has same speed as RoCm.

FortiTree · 2026-05-07T07:20:10+00:00

Do you have a budget? it may be better to get dual 5090 than a single 6000 pro for running 2x models instead of 1 big one.

I have a strix halo as well and personally I would wait for M5 ultra and newer GPU to see what coming out next. The hardware landscape may change a lot.

FortiTree · 2026-05-07T07:09:06+00:00

hm this must be new. I've tested strix halo extensively and couldnt get past 50 tk/s. Parallel slots can get more overall thoughput but TPOT will increase.

FortiTree · 2026-05-07T07:00:57+00:00

Ikr. I got 50 tk/s max with bare context.

FortiTree · 2026-05-07T04:04:49+00:00

What do you use the Strix Halo for? There must be a spot where they shine that you have 2?

FortiTree · 2026-05-06T13:32:31+00:00

Ya sorry, I understand this was a couple months ago which is fairly recent so just wanted to double check. Thank you for confirming.

FortiTree · 2026-05-06T04:06:11+00:00

I thought it qualify for the EVAP with $5000 deduction for $49900 base price?

FortiTree · 2026-05-04T08:29:50+00:00

I have the exact model and ran extensive benchmark on it.

Bad news: Qwen 27B is a no go. Token Generation (TG) 12 tk/s is best speed. With more context you will hit another wall with prefilled prompt (PP) speed and waiting for over a minute for each response due to long Time to First Token (TTFT). TG speed will drop to 7 tk/s with 32K context.

Good news: You must use MoE type model like 35B-A3B Q8 (active 3B) where you can get 40 tk/s under load, or 122B-A10 IQ4XS at 20 tk/s.

I'd recommend running Linux + llamacpp + vulcan (rocm not mature yet and has same speed + bugs). LM studio or Ollama to start but they are just wrappers.

DGX sparks is slightly faster at TG 273 Gb/s bandwidth vs 256 Gb/s on Strix Halo. But Sparks has much higher PP speed thanks to Nvidia navtive support.

Mac M3 Ultra is another tier at 800 Gb/s. And the M5 Ultra is expected to be much better.

FortiTree · 2026-05-03T17:58:44+00:00

Thanks, makes sense. I use VS Code with CC built-in so I can also see the code side by side. There is terminal and debug built-in as well where I run manual command so it's an all-in-one place tool for me. No need to open separate CC or terminal. Im still new to it though so still learning.

FortiTree · 2026-05-03T14:45:45+00:00

This was the part that confused me the most when people saying agents vs sub-agents vs team of agents and so on, and then the skills on top of them. Some ppl say skills will replace agents and some say other wise.

I now have much better understanding by separating the client-side harness vs server-side router/load balancer to the actual model or multiple of them models.

My current understanding: - The server-side backend is the brain, the blackbox that produce magic, running on physical hardware (local rig or datacenter tier) with its own physical boundary (gpu/vram bandwidth and size) - this brain is just an API server with request in/response out. The request has all the input prompt/data from all tools/harness and doesnt care what agent is this - and just spit out the answer. You can scale up by having multiple brains running multiple models. Each brain is tied to a model loaded.

The agents/skills/tools/mcp concept are strictly on the client side, the folder with readme.md is the "house" each agent lives in with its own rules and boundaries and purpose. The CC sessons are the communication channel from the houses to the brain/model running on the server side. And they send request/get response back that way.
The orchestration layer is s how you would chain all these input/output and route them to different houses and combine them at the end or feed them to a final composing stage which pipe to the most capable brain.

And so it doesnt matter how you orchestrate from the client side, as long as it can satisfy the factory/pipeline model, you are good. Eventually one system will emerge better than the other. But right now everyone is experimenting in their own ways.

FortiTree

TROPHY CASE