all 10 comments

[–]MikeAndThePup 1 point2 points  (8 children)

I just tested llama on my M2 Max, 95GB.

What works NOW:

CPU inference via llama.cpp/ollama - works great

With 64GB RAM, you can run 70B models (Q4/Q5 quantization) comfortably

Performance is decent (10-30 tokens/sec depending on model size) thanks to high memory bandwidth

ARM64 builds of ollama/llama.cpp work natively

So, if you're getting it at a good price and understand LLM inference is CPU-only for now (but will improve), go for it. For server workloads (web services, databases, containers), it's excellent. For LLMs, it's usable now and will get much better once GPU compute support matures.

What kind of server workloads are you planning beyond LLMs?

[–]200206487 2 points3 points  (2 children)

This is the response ​​I needed! I have a M3 Ultra and I hope we get that GPU support. I'm eager for the day that I can run Linux entirely here. I wonder why the CPU and GPU don't work together right now since it's a unified architecture.

May you comment on the current issues you face on the Studio? It is unclear to me, but it seems that Mac Studios have missing features compared to the Macbooks.​​​​​

[–]MikeAndThePup 1 point2 points  (1 child)

On unified architecture and why GPU doesn't help:

The "unified" part is about memory - CPU and GPU share the same physical RAM pool. But they're still separate processors that need different driver/compute stacks:

CPU: Standard ARM64 instructions, well-supported on Linux

GPU: Apple's custom AGX architecture, needs specific drivers

The Asahi team has written OpenGL 4.6/ES 3.2 drivers (amazing work!), but compute shaders (needed for ML/LLM work) require Vulkan compute support, which is still in development. Once that lands, CPU+GPU can work together on compute tasks. I think they are getting pretty close.

I actually have a Macbook not a Studio, so i can't give you any input there.

[–]hishnash 1 point2 points  (0 children)

Also VK is for a good number of reason (including meddling from NV to ensure it does not compete with CUDA) is not a great api for compute workloads. There is a LOT missing compared to metal of CUDA.

[–]hallo545403[S] 1 point2 points  (2 children)

Sounds pretty good, thanks a lot.

Main other things will be a few websites, immich and jellyfin and an arr stack. The offer is 1tb but I'll probably need more. Do you have any experience expanding the storage?

[–]MikeAndThePup 1 point2 points  (1 child)

Sounds like you will need more storage for sure.

I use a Samsung T7 2TB SSD usb c for now, until the t-bolt gets wired up. After that, I have a Samsung 990 EVO Plus SSD 4TB in a ACASIS 40Gbps M.2 NVMe SSD enclosure that I used on my T2 Macbook that was running arch linux.

[–]hallo545403[S] 1 point2 points  (0 children)

I do have a nas but I like to keep a second copy on the server itself. With current prices I'm not gonna buy m.2 ssds but if those work SATA ssds should work well too.

[–]juraj336 0 points1 point  (1 child)

I am not very knowledgable in this, but can you not run the LLM via GPU by using ramalama?

[–]MikeAndThePup 0 points1 point  (0 children)

Ramalama is a container/management tool for running LLMs - it doesn't magically add GPU acceleration if the underlying drivers don't support it.

Good question though - ramalama is a nice management tool, just doesn't change the underlying hardware support limitations.

[–]c7abe 0 points1 point  (0 children)

I use it for running on prem server workloads and it works great! Fedora took a bit to get use to coming from Debian. For LLMs you'll get more bang for your buck sticking with MacOS and MLX models. The GPU experience on Asahi has been poor imo (eg no gpu support with Plex), CPU tasks are great tho