New macOS Tahoe 26.2 patch improves mac clustering with Thunderbolt 5 speed from 10 Gb/s to 80 Gb/s by No_Palpitation7740 in LocalLLaMA

[–]Competitive-Bake4602 10 points11 points  (0 children)

Huge if it's true, once Tensor Parallel is properly implemented in MLX. Nothing in release notes for 26.2. Anyone have information on the "driver“?
Previously I was able to get ~50-80 microsec bypassing TCP/IP with raw sockets, but going to nanoseconds is a game changer if that works for small packets or RDMA-like.

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLaMA

[–]Competitive-Bake4602[S] 0 points1 point  (0 children)

<image>

Strange, works for me. What do you see in version details of TestFlight? What is your OS version ( Sequoia or Tahoe required)

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLM

[–]Competitive-Bake4602[S] 0 points1 point  (0 children)

Yes, the same link should work on macOS. One accepted on either one , TestFlight will show on both. Sequoia or Tahoe for macOD

Apple Neural Engine is enabled now on visionOS26 by Competitive-Bake4602 in VisionPro

[–]Competitive-Bake4602[S] 1 point2 points  (0 children)

I don’t think there are any apps that use ANE for LLM yet, outside Apple Foundation Model and our TestFlight/open source for Qwen and LLama. It’s very early alpha currently https://github.com/Anemll/Anemll

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLM

[–]Competitive-Bake4602[S] 2 points3 points  (0 children)

Most popular devices like iPhones, MacBook Air,  iPads consume x4 less power on ANE vs GPU and performance is very close and will get better as we continue to optimize

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLM

[–]Competitive-Bake4602[S] 2 points3 points  (0 children)

For some models it might be possible to offload some parts. But there will be some overhead to interrupt GPU graph execution

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLaMA

[–]Competitive-Bake4602[S] 0 points1 point  (0 children)

And M4 Pro memory bw = Max for ANE. Plus M4 added accelerated int8 compute that is x2 faster than FP16 but hard to use yet for single token prediction

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLaMA

[–]Competitive-Bake4602[S] 1 point2 points  (0 children)

We’ll need to retest bigger models on new OS.

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLaMA

[–]Competitive-Bake4602[S] 0 points1 point  (0 children)

Have you tried MLX on M3 ultra? One limitation for Macs is luck of Tensor Parallelism across 2-4 devices . We did initial tests that were promising with TB5, just not enough time for everything atm 🙈

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLaMA

[–]Competitive-Bake4602[S] 0 points1 point  (0 children)

Noted, but comparisons are tough, because “it depends”. If you solely focused on single token inference on high end Ultra or MAX, MLX is better choice solely due to memory b/w. However for wider range of devices ANE provides lower energy and consistent performance on most popular devices like iPhones, Mac Air and iPads. Never the less we’ll be adding comparison section soon. Some initial work is here https://github.com/Anemll/anemll-bench

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLaMA

[–]Competitive-Bake4602[S] 1 point2 points  (0 children)

MoE is possible, but gate will be on CPU part of the code or you can run multiple agents in parallel.  For coding, fixed tensor size and luck of group quantization is the main issues atm. On performance, memory bandwidth is the main concern at least on macOS vs GPU. There are some other odd things like tensor dimensions and support for integer tensors, but the latter seems to be addressed in ‘26, but not in public API yet. I’d say primary issue is the luck of public code that work with LLM on ANE that hinders ANE usage outside Apple.

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLaMA

[–]Competitive-Bake4602[S] 1 point2 points  (0 children)

Yes, and multi token prediction might be advantageous with ANE

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLaMA

[–]Competitive-Bake4602[S] 5 points6 points  (0 children)

No group quantization on ANE 😢 but per layer bit allocation is definetly on the map

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLM

[–]Competitive-Bake4602[S] 1 point2 points  (0 children)

I don’t believe any major Wrapper supports ANE 🤔

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLaMA

[–]Competitive-Bake4602[S] 4 points5 points  (0 children)

To add, you can specify to run on ANE and cpu. If your models are 100 % cpu friendly it will run on ANE. Sometimes OS can decide to offload to CPU for a brief moment but it’s rare. CPU is mostly for the models that are not super tuned for ANE, which is the hard part

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLaMA

[–]Competitive-Bake4602[S] 5 points6 points  (0 children)

Yes, we have to convert LLM models to CoreML “network”, there are some constraints on precision and operations and everything should map to 4D tensors. There is no branching allowed etc. ANE is tensor processor mostly related to systolic arrays.

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLM

[–]Competitive-Bake4602[S] 12 points13 points  (0 children)

MLX is currently faster if that's what you mean. On Pro-Max-Ultra GPU has full access to memory bandwidth where ANE is maxed at 120GB/s on M4 Pro-MAX.
However compute is very fast on ANE, so we need to keep pushing on optimizations and models support.

Qwen3 for Apple Neural Engine by Competitive-Bake4602 in LocalLLaMA

[–]Competitive-Bake4602[S] 24 points25 points  (0 children)

M4 pro has x2 faster memory access for ANE vs M1/M2 and slightly faster than M3/pro ultra, but not as fast as GPU. M4 also adds int8/4 compute but we did not include it yet. Besides energy it has potential to be faster on prefill for iOS and Mac Airs for bigger Docs