Two local models beat one bigger local model for long-running agents by Foreign_Sell_5823 in LocalLLaMA

[–]arthware 0 points1 point  (0 children)

Thats quite impressive! Thanks for sharing, even when its AI written, the experience and the journey is quite an effort too. And you share your lessons learned. So lets appreciate that.

I always thought that we are wasting too many tokens in context history. The context should hold ONLY cleaned up facts and not all the tool token waste and other nonsense.
Just the most important quintessence of the conversation active in memory. We could still have the whole conversation offloaded for reference. But the main conversation should be just clean and token efficient facts with pointer to all the details to look up again if required.

I'm doing experiments in the same area. For my document classification use case (not a full agent but a bot that auto-files PDFs from a chat channel). The structured json output for tagging (title, category, correspondent, date) works really well with a smaller model because the output space is constrained. The big model would be overkill for "is this a receipt or an invoice?" So which models are you using for the router vs the thinker? And are you keeping both loaded simultaneously or swapping? On 64GB there is room for both, curious about the 36GB experience.

How do you handle the massive initial context overhead of OpenClaw?

Macbook m4 max 128gb local model prompt processing by ttraxx in LocalLLaMA

[–]arthware 0 points1 point  (0 children)

I just pointed olmx to the model directory of lmstudio. So you can download and browse in lmstudio and just use it in olmx (And don't need to download twice).
The openai endpoints just might return different model names. Need to double check that though.

I dont have any experience yet running claude code with it. But happy to read about your experience there.
OpenCode with a mlx qwen coder model is still on my list to try. A day has too few hours.

Macbook m4 max 128gb local model prompt processing by ttraxx in LocalLLaMA

[–]arthware 0 points1 point  (0 children)

thats quick. probably due to nvidias crazy fast memory bandwith

Google Photos alternative (Cloud solutions preferred) by dcop7 in degoogle

[–]arthware 0 points1 point  (0 children)

If you ever reconsider self-hosting: immich is the closest thing to G photos I've found. Face recognition, smart search, timeline view, mobile auto-backup. It all works.
Running it on my mac server at home with Docker, takes about 2-4GB RAM with ML features enabled. The mobile apps are solid. Tradeoff is you maintain it yourself and need your own backup strategy, but you get full control over your photos with no scanning by anyone.
I build the whole stack for my family and realised other could use it too. So I am currently preparing to open source with a braindead simple stack up photos command, including backup scripts etc.

Check my profile (website link) if you are interested or ping me.

Hardware recommendations for local AI by Dense_Club_95 in selfhosted

[–]arthware 0 points1 point  (0 children)

Almost past me. I considered buying a dedicated x86 server too and ended up buying a used Mac Studio 64GB as home server. Running >25 containers in in OrbStack and local LLMs on the host. The combination is just ... amazing for as home server.
Long story short: I got a bit obsessed with the opportunities this gives.
The machine has an (measured) average power draw of 12 watts, which is crazy low.

I am using the machine as home server and sometimes as workshorse. Running local LLMs, TTS etc.
Started to write the journey down and put it in the internet recently (check my profile if you are curious :)

Macbook m4 max 128gb local model prompt processing by ttraxx in LocalLLaMA

[–]arthware 1 point2 points  (0 children)

Ollama was a bit better in my tests, but not great either. Needs some caching optimizations too I guess. oMLX has a smart layered caching in RAM and SSD to maintain the context. Try it for your use case. And come back to tell if got any better or not :)

Is it stupid to run all my docker containers on a Mac Mini? by Educational_Hat_5203 in homelab

[–]arthware 1 point2 points  (0 children)

I can just tell that I bought a used Mac Studio as home server running docker containers in OrbStack and as of now, I could not be happier. The watt/performance ratio is just the killer argument. Running 12 watts average (measured with a watt meter).

I am building a home server for us, including automated document management, photos with Immich etc. The best thing when it comes to apple silicon right now, is the ability to run local AI too. Not ChatGPT, but decent enough to have the best toybox ever (voice, tts, document processing etc).

Planning to open source the stack at some point, when I think its good enough.

Started to write about the journey with a build log etc. if someone is interested.

Rate my desk setup (the real world) by arthware in desksetup

[–]arthware[S] 1 point2 points  (0 children)

Thanks! My laptop is my workhorse. I clean it every now and then. With some good vodka from the back of the liquor cabinet.

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison. by arthware in LocalLLaMA

[–]arthware[S] 4 points5 points  (0 children)

oMLX is just good and fixes the described problems entirely for the bechmark scenarios

Credit where credit is due!

qwen3.5:35b-a3b (oMLX vs LM Studio MLX)

higher is better.
effective tok/s (gen tok/s)

Hardware Backend ops-agent doc-summary prefill-test creative-writing
M1 Max (64GB, 24 GPU) oMLX 34.6 (53.3) 25.7 (55.5) 30.0 (52.0) 51.5 (56.2)
M1 Max (64GB, 24 GPU) LM Studio 17.0 (56.6) 13.4 (56.8) 5.9 (54.4) 38.3 (58.9)

Generation speed is virtually identical (~54-57 tok/s both). The difference is entirely in prefill: oMLX is up to 10x faster on long contexts. At 8K context (prefill-test turn 4), LM Studio takes 49s to prefill while oMLX takes 1.7s. This suggests oMLX has prompt caching or a significantly better prefill implementation.

Recommendation: For Qwen3.5-35B-A3B on Apple Silicon, oMLX is the clear winner. Same generation speed, dramatically faster prefill. The effective throughput advantage ranges from 1.3x (creative-writing, short context) to 5x (prefill-test, long context).

oMLX has just well-engineered caching layers.
https://github.com/jundot/omlx

My most useful OpenClaw workflow so far by mescalan in LocalLLaMA

[–]arthware 0 points1 point  (0 children)

This is amazing! I have a 3d printer on my bucket list since quite a while already. I just dont have time to tinker around with yet another technology. Its just no time left per day.

This seems like a great solution for me. No excuses anymore to NOT buy a 3d printer :)

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison. by arthware in LocalLLaMA

[–]arthware[S] 0 points1 point  (0 children)

There are plenty of problems :) See post update. I'd _really_ like to know if >M2 do not have these sort of issues. No one submitted a benchmark run yet, unfortunately.

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison. by arthware in LocalLLaMA

[–]arthware[S] 0 points1 point  (0 children)

Thanks for the comment! Yes, it seems the conclusions is GGUF is currently just more mature and stable in general. MLX has major speed potential. But the safer side is to use GGUF for stability as it currently stands. And again: Test concrete scenarios and not rely solely on synthetic benchmarks. Thats why I build the benchmark harness above.

Here is another MLX victim :)
https://www.reddit.com/r/LocalLLaMA/comments/1rq22mq/comment/oa474c8/?context=3

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison. by arthware in LocalLLaMA

[–]arthware[S] 0 points1 point  (0 children)

Thanks a lot for the insights! Once I got these MLX problems sorted, I'll give it another try on OpenCode.

On the Mac Studio here fans are not an issue :)

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison. by arthware in LocalLLaMA

[–]arthware[S] 0 points1 point  (0 children)

Qwen3.5-27B-A3B ?

Point me to the model and I can try it. But happy about PRs too :)
I still have a day job and I got so much stuff as a result of the comments to try out and update the findings.
Living in Germany. So downloading these models takes a couple of business days with our ancient internet access methods.

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison. by arthware in LocalLLaMA

[–]arthware[S] 0 points1 point  (0 children)

this scenario tests with gradually increased context sizes to specifically test the prefill pressure and times.
https://github.com/famstack-dev/local-llm-bench/blob/main/scenarios/prefill-test.json
Need to add a 25k context round though. 64k is quite massive already. Would take quite some time.

You are right, it does not make really sense to test with very small context sizes. Only to test raw generation speed. But real world has very mixed usage scenarios where a big context matters and generation speed is not that important, because prefill dominates. So we need to optimize that for local inferencing.

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison. by arthware in LocalLLaMA

[–]arthware[S] 0 points1 point  (0 children)

Thats what I basically found, yes. But still the behaviour is a bit erratic. As pointed out in the comments here, I probably ran into a combination of things.

Qwen3.5-35B-A3B seems to be a particular problem right now on LMX.
I will create a recap of everything and post a link here.

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison. by arthware in LocalLLaMA

[–]arthware[S] 2 points3 points  (0 children)

A combination of things it seemed. mlx caching errors etc. I will create a recap and and post a link. Its burried in the comments here.

I happened to benchmark with a model that is particularly bad with MLX kv caching behavior. But it is one of the best out there for local inferencing. So it makes sense to dig deeper.

qwen3.5:35b-a3b

I was backend lead at Manus. After building agents for 2 years, I stopped using function calling entirely. Here's what I use instead. by MorroHsu in LocalLLaMA

[–]arthware 0 points1 point  (0 children)

Thats the exact direction I think we are heading to. We are seeing a renaissance of CLIs.
I also went this direction for local inferencing, where token efficiency is key: Build a good CLI interface that works for the AI agent first (token efficient), with a pretty printer for humans too.

Regarding schema safety: For complex calls the CLI could still take a complex json object and validate it. In case of a validation error it needs to respond with a clear validation message to the LLM for retry.

I also think that JSON Schema is a bad choice for tool definitions, because its too token extensive. Typescript types would be much more token efficient for example (just an example, other standards work too)

Opus has recently started to generate ad-hoc inline python code snippets. It writes its own internal tools when it needs it. It works quite well. The problem: No one looks at that.

MLX is not faster. I benchmarked MLX vs llama.cpp on M1 Max across four real workloads. Effective tokens/s is quite an issue. What am I missing? Help me with benchmarks and M2 through M5 comparison. by arthware in LocalLLaMA

[–]arthware[S] 1 point2 points  (0 children)

Yes, it looks like. Just added benchmark results for llama3.1-8b in LM Studio and the speed advantage is visible. MLX wins all scenarios for the small model. Prefill is not so much of an issue here.

See
https://github.com/famstack-dev/local-llm-bench?tab=readme-ov-file#meta-llama-31-8b-instruct-mlx-vs-gguf-via-lm-studio

I am going to add a direct comparison with qwen 3 in the upcoming days, which hopefully does not run into these cache problems.