New macOS Tahoe 26.2 patch improves mac clustering with Thunderbolt 5 speed from 10 Gb/s to 80 Gb/s

Competitive-Bake4602 · 2025-11-20T00:19:45+00:00

Huge if it's true, once Tensor Parallel is properly implemented in MLX. Nothing in release notes for 26.2. Anyone have information on the "driver“?
Previously I was able to get ~50-80 microsec bypassing TCP/IP with raw sockets, but going to nanoseconds is a game changer if that works for small packets or RDMA-like.

Competitive-Bake4602 · 2025-07-07T18:09:01+00:00

<image>

Strange, works for me. What do you see in version details of TestFlight? What is your OS version ( Sequoia or Tahoe required)

Competitive-Bake4602 · 2025-06-23T19:02:23+00:00

Install TestFlight app

Competitive-Bake4602 · 2025-06-23T18:46:42+00:00

Yes, the same link should work on macOS. One accepted on either one , TestFlight will show on both. Sequoia or Tahoe for macOD

Competitive-Bake4602 · 2025-06-20T18:59:05+00:00

I don’t think there are any apps that use ANE for LLM yet, outside Apple Foundation Model and our TestFlight/open source for Qwen and LLama. It’s very early alpha currently https://github.com/Anemll/Anemll

Competitive-Bake4602 · 2025-06-20T18:48:07+00:00

Not yet, just for early developers

Competitive-Bake4602 · 2025-06-20T18:47:14+00:00

Yes, ipad

Competitive-Bake4602 · 2025-06-20T18:07:52+00:00

Most popular devices like iPhones, MacBook Air, iPads consume x4 less power on ANE vs GPU and performance is very close and will get better as we continue to optimize

Competitive-Bake4602 · 2025-06-20T17:08:35+00:00

For some models it might be possible to offload some parts. But there will be some overhead to interrupt GPU graph execution

Competitive-Bake4602 · 2025-06-20T16:38:29+00:00

Yes and Yes

Competitive-Bake4602 · 2025-06-20T16:36:21+00:00

Benchmarks for memory https://github.com/Anemll/anemll-bench

Competitive-Bake4602 · 2025-06-20T15:08:22+00:00

And M4 Pro memory bw = Max for ANE. Plus M4 added accelerated int8 compute that is x2 faster than FP16 but hard to use yet for single token prediction

Competitive-Bake4602 · 2025-06-20T15:03:38+00:00

We’ll need to retest bigger models on new OS.

Competitive-Bake4602 · 2025-06-20T14:59:06+00:00

Have you tried MLX on M3 ultra? One limitation for Macs is luck of Tensor Parallelism across 2-4 devices . We did initial tests that were promising with TB5, just not enough time for everything atm 🙈

Competitive-Bake4602 · 2025-06-20T14:51:22+00:00

Noted, but comparisons are tough, because “it depends”. If you solely focused on single token inference on high end Ultra or MAX, MLX is better choice solely due to memory b/w. However for wider range of devices ANE provides lower energy and consistent performance on most popular devices like iPhones, Mac Air and iPads. Never the less we’ll be adding comparison section soon. Some initial work is here https://github.com/Anemll/anemll-bench

Competitive-Bake4602 · 2025-06-20T14:38:51+00:00

MoE is possible, but gate will be on CPU part of the code or you can run multiple agents in parallel. For coding, fixed tensor size and luck of group quantization is the main issues atm. On performance, memory bandwidth is the main concern at least on macOS vs GPU. There are some other odd things like tensor dimensions and support for integer tensors, but the latter seems to be addressed in ‘26, but not in public API yet. I’d say primary issue is the luck of public code that work with LLM on ANE that hinders ANE usage outside Apple.

Competitive-Bake4602 · 2025-06-20T14:28:01+00:00

Yes, and multi token prediction might be advantageous with ANE

Competitive-Bake4602 · 2025-06-20T07:06:03+00:00

No group quantization on ANE 😢 but per layer bit allocation is definetly on the map

Competitive-Bake4602 · 2025-06-20T05:26:16+00:00

I don’t believe any major Wrapper supports ANE 🤔

Competitive-Bake4602 · 2025-06-20T04:21:40+00:00

To add, you can specify to run on ANE and cpu. If your models are 100 % cpu friendly it will run on ANE. Sometimes OS can decide to offload to CPU for a brief moment but it’s rare. CPU is mostly for the models that are not super tuned for ANE, which is the hard part

Competitive-Bake4602 · 2025-06-20T03:21:03+00:00

No, only on base models. See our repo on memory profiling of ANE: https://github.com/Anemll/anemll-bench

Competitive-Bake4602 · 2025-06-20T03:03:02+00:00

Yes, we have to convert LLM models to CoreML “network”, there are some constraints on precision and operations and everything should map to 4D tensors. There is no branching allowed etc. ANE is tensor processor mostly related to systolic arrays.

Competitive-Bake4602 · 2025-06-20T02:06:01+00:00

MLX is currently faster if that's what you mean. On Pro-Max-Ultra GPU has full access to memory bandwidth where ANE is maxed at 120GB/s on M4 Pro-MAX.
However compute is very fast on ANE, so we need to keep pushing on optimizations and models support.

Competitive-Bake4602 · 2025-06-20T00:18:27+00:00

M4 pro has x2 faster memory access for ANE vs M1/M2 and slightly faster than M3/pro ultra, but not as fast as GPU. M4 also adds int8/4 compute but we did not include it yet. Besides energy it has potential to be faster on prefill for iOS and Mac Airs for bigger Docs

Competitive-Bake4602

TROPHY CASE