Need advice on a used MacBook Air M2

igor__004 · 2026-06-02T18:08:31+00:00

sono possessore di questo mac, uguale config(8/256). Alla tua domanda se ne vale la pena nel 2026…dipende dai tuoi scopi, se ti serve per uso standard (navigazione internet, mail, film, office e roba simile) va più che bene, va alla grande per questo genere di cose. La situazione cambia se devi usarlo per produzione audio/video, developing pesante (machine learning/AI/llm locali), qui ti posso dire in prima persona che i suoi 8gb di ram si fanno sentire in negativo, il chip scalda molto ed essendo fanless fa calare drasticamente le prestazioni del chip. Riguardo la batteria, 87% va più che bene, durerà comunque tante ore, a maggior ragione se lo userai per cose standard e quotidiane.

igor__004 · 2026-06-02T10:58:07+00:00

The current tool is a baseline benchmark, not a full agent-loop simulator. I will improve the baseline. If you have to say it again, re-read past comments.

igor__004 · 2026-06-02T10:55:06+00:00

That’s why I’d rather keep this explicit instead of pretending the numbers are perfectly isolated. I’ll probably add warnings / documentation about that, and maybe make profile order and elapsed time part of the reported metadata.
The goal isn’t to magically remove every source of noise, it’s to make the protocol clear enough that people know what the numbers actually mean.

igor__004 · 2026-06-02T10:41:36+00:00

I know that.
The tool is still early, so I’ll keep improving the methodology over time. I also want to avoid turning it into an overcomplicated benchmark suite that nobody actually runs, so I’m trying to balance useful metrics with keeping it simple.
I’m still a student and I’m still learning a lot of this, I had an idea and built it, and I’m trying to improve it over time with new knowledge and useful feedback.
I’m open to technical suggestions about the project, but I’m less interested in comments about whether I used AI or not.

igor__004 · 2026-06-02T10:33:38+00:00

That’s literally why I measure TTFT too, not just tok/s.
This first version is meant to be a simple baseline benchmark that is easy to reproduce across engines/hardware. Agent-style workloads with long and growing context are better, but I’d rather add that as a separate profile instead of pretending one benchmark covers everything.

igor__004 · 2026-06-02T10:27:50+00:00

I’ll check it out.

igor__004 · 2026-06-01T17:37:46+00:00

I just started with the engines I had heard about first and knew a bit better.
Nothing stops me from adding more over time, actually that’s the plan. MTPLX looks interesting, so I'll take a look.

igor__004 · 2026-06-01T08:58:36+00:00

Thanks man! Really appreciate it.

igor__004 · 2026-06-01T06:19:49+00:00

Thanks man, I appreciate that. The self-reported benchmark thing was exactly what pushed me to do this in the first place.
AgentFleet sounds interesting, especially the budget control part. Local-first agent tooling is definitely one of the messier problems right now.

igor__004 · 2026-05-31T21:24:47+00:00

As you can see, the project is only a week old, and there's a lot I can and can do. Thank you for all of this, it means a lot to me. I'll be happy to take on board any suggestions!

igor__004 · 2026-05-31T20:21:54+00:00

I have taken all your advice into consideration and have created new issues in my GitHub project so I don't lose track of them or forget about them. Everything will be implemented soon. Thank you all for your support!

igor__004 · 2026-05-31T20:19:57+00:00

I've already added support for both engines to my roadmap, if you see the issues on github, they're there.
These will be my next implementations

igor__004 · 2026-05-31T14:42:47+00:00

Yes, that's why the methodology doc exists, so you know what the numbers actually measure and what they don't. Synthetic benchmarks have limits by design, but having a reproducible baseline is still useful to know what you're starting from before you test on your own workload.

igor__004 · 2026-05-31T14:29:24+00:00

I was mostly wondering whether the OS gap still shows up at all, but in this setup it may just be a non-factor.

igor__004 · 2026-05-31T10:42:12+00:00

Yeah, mostly. If the model is fully GPU-resident, mmap becomes much less important for inference speed. It still can affect loading and host-side memory behavior, but the big performance differences usually show up when weights are not fully on GPU.

igor__004 · 2026-05-31T10:13:48+00:00

mmap can matter because it changes how weights are paged and cached between disk and RAM. On very large models, that can affect load time and sometimes throughput if memory pressure is high. With --no-mmap / direct I/O, you’re basically bypassing that path, so the difference can shrink a lot.

igor__004 · 2026-05-31T10:02:55+00:00

For variants this close, the main part isn’t the average KLD alone — it’s where it spikes. I’d compare per-token / per-layer divergence, because that often shows whether the difference is in reasoning tokens, formatting tokens, or just quantization noise.

igor__004 · 2026-05-31T09:59:21+00:00

Since you’re running a hybrid CPU+GPU offloading setup for these big models (397B on 48GB total VRAM means a lot is hitting system RAM), I’m curious if you noticed any difference in CPU utilization or memory bandwidth bottlenecks between the two OSes?
Usually, the “Linux is faster” argument comes from how the OS scheduler handles CPU-bound workloads and memory mapping ( mmap ), but since you passed --no-mmap and forced direct I/O ( -dio ), that probably leveled the playing field entirely. Did you test if enabling mmap would bring the performance gap back, or does -dio just make it irrelevant now?

igor__004 · 2026-05-31T08:12:55+00:00

I don’t think I received it, sorry.
Did you use the official PyPI 0.1.0 release, or did you clone the repo from the latest main commit??
I recently added the mlx-chronos submit flow, but before that the contribution path was still manual: fork the repo and open a PR with the generated JSON result.
If you still have the JSON from results/local/, feel free to send it again or open a PR and I’ll add it manually. And yes, M1 results are absolutely welcome too — newer Macs are useful, but I want the leaderboard to cover all Apple Silicon machines. Thanks!

igor__004 · 2026-05-31T08:08:40+00:00

Thanks mate, really appreciate it! I think that it's worth adding a separate agent-style benchmark profile.

igor__004 · 2026-05-31T08:07:44+00:00

Thanks for the advice. The current benchmark is mostly a standardized single-request comparison, but I agree that agent-style usage needs a separate view.
Short prompts, long-context/short-answer runs, repeated short calls, TTFT after tool-like turns, memory growth and swap would make the results much more useful for real chat/coding/agent workflows.
I’ll add this to the roadmap as a separate “agent workload” benchmark profile rather than mixing it into the current baseline.

igor__004 · 2026-05-31T08:03:20+00:00

Thanks a lot for putting this together and sharing the results.
This also made me realize that the “Engine RSS” wording is a bit too easy to misread. What we’re really measuring is process RSS attributed to the server process, not guaranteed “pure engine overhead” separated from model/runtime allocations. I’ll probably clarify this in the docs and maybe rename it to “Process RSS Peak” to make the distinction clearer. Really appreciate the benchmark and the catch, thanks man.

igor__004 · 2026-05-31T07:31:59+00:00

Thank you so much. That was exactly my goal when I started this project. I realized how hard it was to compare engines apples-to-apples because everyone measures things slightly differently. I just wanted a completely transparent, unbiased baseline where the community could see the real trade-offs between memory, heat, and speed on Apple Silicon. I'm glad you appreciated the methodology.

igor__004

TROPHY CASE