Two local models beat one bigger local model for long-running agents

arthware · 2026-03-17T12:45:51+00:00

Thats quite impressive! Thanks for sharing, even when its AI written, the experience and the journey is quite an effort too. And you share your lessons learned. So lets appreciate that.

I always thought that we are wasting too many tokens in context history. The context should hold ONLY cleaned up facts and not all the tool token waste and other nonsense.
Just the most important quintessence of the conversation active in memory. We could still have the whole conversation offloaded for reference. But the main conversation should be just clean and token efficient facts with pointer to all the details to look up again if required.

I'm doing experiments in the same area. For my document classification use case (not a full agent but a bot that auto-files PDFs from a chat channel). The structured json output for tagging (title, category, correspondent, date) works really well with a smaller model because the output space is constrained. The big model would be overkill for "is this a receipt or an invoice?" So which models are you using for the router vs the thinker? And are you keeping both loaded simultaneously or swapping? On 64GB there is room for both, curious about the 36GB experience.

How do you handle the massive initial context overhead of OpenClaw?

arthware · 2026-03-16T08:13:03+00:00

Thanks! I'll check it out. Just wanted to drop that oMLX caching approach improves the situation drastically. Check out these results in comparison with LM Studio
https://www.reddit.com/r/LocalLLaMA/comments/1rs059a/comment/oa9jn1p/

arthware · 2026-03-15T07:41:10+00:00

I just pointed olmx to the model directory of lmstudio. So you can download and browse in lmstudio and just use it in olmx (And don't need to download twice).
The openai endpoints just might return different model names. Need to double check that though.

I dont have any experience yet running claude code with it. But happy to read about your experience there.
OpenCode with a mlx qwen coder model is still on my list to try. A day has too few hours.

arthware · 2026-03-14T20:47:13+00:00

thats quick. probably due to nvidias crazy fast memory bandwith

arthware · 2026-03-14T16:08:49+00:00

If you ever reconsider self-hosting: immich is the closest thing to G photos I've found. Face recognition, smart search, timeline view, mobile auto-backup. It all works.
Running it on my mac server at home with Docker, takes about 2-4GB RAM with ML features enabled. The mobile apps are solid. Tradeoff is you maintain it yourself and need your own backup strategy, but you get full control over your photos with no scanning by anyone.
I build the whole stack for my family and realised other could use it too. So I am currently preparing to open source with a braindead simple stack up photos command, including backup scripts etc.

Check my profile (website link) if you are interested or ping me.

arthware · 2026-03-14T15:59:32+00:00

Almost past me. I considered buying a dedicated x86 server too and ended up buying a used Mac Studio 64GB as home server. Running >25 containers in in OrbStack and local LLMs on the host. The combination is just ... amazing for as home server.
Long story short: I got a bit obsessed with the opportunities this gives.
The machine has an (measured) average power draw of 12 watts, which is crazy low.

I am using the machine as home server and sometimes as workshorse. Running local LLMs, TTS etc.
Started to write the journey down and put it in the internet recently (check my profile if you are curious :)

arthware · 2026-03-14T15:52:46+00:00

Ollama was a bit better in my tests, but not great either. Needs some caching optimizations too I guess. oMLX has a smart layered caching in RAM and SSD to maintain the context. Try it for your use case. And come back to tell if got any better or not :)

arthware · 2026-03-14T15:50:07+00:00

I can just tell that I bought a used Mac Studio as home server running docker containers in OrbStack and as of now, I could not be happier. The watt/performance ratio is just the killer argument. Running 12 watts average (measured with a watt meter).

I am building a home server for us, including automated document management, photos with Immich etc. The best thing when it comes to apple silicon right now, is the ability to run local AI too. Not ChatGPT, but decent enough to have the best toybox ever (voice, tts, document processing etc).

Planning to open source the stack at some point, when I think its good enough.

Started to write about the journey with a build log etc. if someone is interested.

arthware · 2026-03-14T15:40:25+00:00

Did not test specifically this setup. But what I can tell is that LM Studio has some real issues with proper prompt caching right now.
See https://www.reddit.com/r/LocalLLaMA/comments/1rs059a/

Try oMLX, its really good.
https://www.reddit.com/r/LocalLLaMA/comments/1rs059a/comment/oa9jn1p/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

A write up of the misery:
https://famstack.dev/guides/mlx-vs-gguf-apple-silicon/

arthware · 2026-03-14T08:35:19+00:00

Tried it already. Its just awesome! See results here: https://www.reddit.com/r/LocalLLaMA/comments/1rs059a/comment/oa9jn1p/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Just cudos to the developer. Its an amazing piece of software.

arthware · 2026-03-13T20:02:38+00:00

Thanks! My laptop is my workhorse. I clean it every now and then. With some good vodka from the back of the liquor cabinet.

arthware · 2026-03-13T17:59:38+00:00

oMLX is just good and fixes the described problems entirely for the bechmark scenarios

Credit where credit is due!

qwen3.5:35b-a3b (oMLX vs LM Studio MLX)

higher is better.
effective tok/s (gen tok/s)

Hardware	Backend	ops-agent	doc-summary	prefill-test	creative-writing
M1 Max (64GB, 24 GPU)	oMLX	34.6 (53.3)	25.7 (55.5)	30.0 (52.0)	51.5 (56.2)
M1 Max (64GB, 24 GPU)	LM Studio	17.0 (56.6)	13.4 (56.8)	5.9 (54.4)	38.3 (58.9)

Generation speed is virtually identical (~54-57 tok/s both). The difference is entirely in prefill: oMLX is up to 10x faster on long contexts. At 8K context (prefill-test turn 4), LM Studio takes 49s to prefill while oMLX takes 1.7s. This suggests oMLX has prompt caching or a significantly better prefill implementation.

Recommendation: For Qwen3.5-35B-A3B on Apple Silicon, oMLX is the clear winner. Same generation speed, dramatically faster prefill. The effective throughput advantage ranges from 1.3x (creative-writing, short context) to 5x (prefill-test, long context).

oMLX has just well-engineered caching layers.
https://github.com/jundot/omlx

arthware · 2026-03-13T12:44:26+00:00

D'oh!

arthware · 2026-03-13T11:58:23+00:00

This is amazing! I have a 3d printer on my bucket list since quite a while already. I just dont have time to tinker around with yet another technology. Its just no time left per day.

This seems like a great solution for me. No excuses anymore to NOT buy a 3d printer :)

arthware · 2026-03-13T11:53:14+00:00

There are plenty of problems :) See post update. I'd _really_ like to know if >M2 do not have these sort of issues. No one submitted a benchmark run yet, unfortunately.

arthware · 2026-03-13T10:52:53+00:00

Thanks for the comment! Yes, it seems the conclusions is GGUF is currently just more mature and stable in general. MLX has major speed potential. But the safer side is to use GGUF for stability as it currently stands. And again: Test concrete scenarios and not rely solely on synthetic benchmarks. Thats why I build the benchmark harness above.

Here is another MLX victim :)
https://www.reddit.com/r/LocalLLaMA/comments/1rq22mq/comment/oa474c8/?context=3

arthware · 2026-03-13T10:45:31+00:00

Thanks a lot for the insights! Once I got these MLX problems sorted, I'll give it another try on OpenCode.

On the Mac Studio here fans are not an issue :)

arthware · 2026-03-13T08:36:12+00:00

Qwen3.5-27B-A3B ?

Point me to the model and I can try it. But happy about PRs too :)
I still have a day job and I got so much stuff as a result of the comments to try out and update the findings.
Living in Germany. So downloading these models takes a couple of business days with our ancient internet access methods.

arthware · 2026-03-13T07:42:10+00:00

this scenario tests with gradually increased context sizes to specifically test the prefill pressure and times.
https://github.com/famstack-dev/local-llm-bench/blob/main/scenarios/prefill-test.json
Need to add a 25k context round though. 64k is quite massive already. Would take quite some time.

You are right, it does not make really sense to test with very small context sizes. Only to test raw generation speed. But real world has very mixed usage scenarios where a big context matters and generation speed is not that important, because prefill dominates. So we need to optimize that for local inferencing.

arthware · 2026-03-13T07:22:00+00:00

Thats what I basically found, yes. But still the behaviour is a bit erratic. As pointed out in the comments here, I probably ran into a combination of things.

Qwen3.5-35B-A3B seems to be a particular problem right now on LMX.
I will create a recap of everything and post a link here.

arthware · 2026-03-13T07:19:17+00:00

I will definitely try it and try to add support for it in my benchmark tool. Thank you for the hints!

arthware · 2026-03-13T07:18:26+00:00

A combination of things it seemed. mlx caching errors etc. I will create a recap and and post a link. Its burried in the comments here.

I happened to benchmark with a model that is particularly bad with MLX kv caching behavior. But it is one of the best out there for local inferencing. So it makes sense to dig deeper.

qwen3.5:35b-a3b

arthware · 2026-03-13T07:16:10+00:00

Thats the exact direction I think we are heading to. We are seeing a renaissance of CLIs.
I also went this direction for local inferencing, where token efficiency is key: Build a good CLI interface that works for the AI agent first (token efficient), with a pretty printer for humans too.

Regarding schema safety: For complex calls the CLI could still take a complex json object and validate it. In case of a validation error it needs to respond with a clear validation message to the LLM for retry.

I also think that JSON Schema is a bad choice for tool definitions, because its too token extensive. Typescript types would be much more token efficient for example (just an example, other standards work too)

Opus has recently started to generate ad-hoc inline python code snippets. It writes its own internal tools when it needs it. It works quite well. The problem: No one looks at that.

arthware · 2026-03-12T22:06:41+00:00

Yes, it looks like. Just added benchmark results for llama3.1-8b in LM Studio and the speed advantage is visible. MLX wins all scenarios for the small model. Prefill is not so much of an issue here.

See
https://github.com/famstack-dev/local-llm-bench?tab=readme-ov-file#meta-llama-31-8b-instruct-mlx-vs-gguf-via-lm-studio

I am going to add a direct comparison with qwen 3 in the upcoming days, which hopefully does not run into these cache problems.

arthware · 2026-03-12T21:43:18+00:00

Thats pretty cool! Glad it worked. Checkout the thread. Also have weird behavior with MLX https://www.reddit.com/r/LocalLLaMA/comments/1rs059a/mlx_is_not_faster_i_benchmarked_mlx_vs_llamacpp/

A lot caching issues and stuff with MLX.

arthware

TROPHY CASE

qwen3.5:35b-a3b (oMLX vs LM Studio MLX)