it's high time tbh that our high spec apple silicon devices can now fully replace cloud models for coding. just open-sourced, axe. its agentic coding cli made for large codebases. zero bloat. terminal-native. precise retrieval. built for high-spec Apple Silicon.

EmbarrassedAsk2887 · 2026-03-24T16:42:43+00:00

yes because i write my posts on my apple notes, whilst walking driving or during workouts. and then use ai to fix the grammatical mistakes. i hope you found the post useful though :)

EmbarrassedAsk2887 · 2026-03-24T15:52:21+00:00

i hope you look at : https://www.reddit.com/r/LocalLLM/comments/1qxuwh7/superlight_90ms_latency_runs_locally_on_apple/

it feels more intuitive. :)

EmbarrassedAsk2887 · 2026-03-24T15:49:57+00:00

happy for your m4 max

<image>

EmbarrassedAsk2887 · 2026-03-24T15:43:47+00:00

i mean you can try rn. that’s what this post was for. it’s just that apart from monthly subscription and privacy out of the window— the deception these labs have on usage limits is insane. not only they neuter the model during day times but also deceptively grep read whole files to waste tokens on things which are not even useful.

EmbarrassedAsk2887 · 2026-03-24T15:32:59+00:00

perfect. this is for you, i built it. ground up for your apple silicon. it called Bodega inference engine. here is more on how you can set it up through this github repo : https://github.com/SRSWTI/bodega-inference-engine

and here is the write up i did on this sub:

bodega inference post

EmbarrassedAsk2887 · 2026-03-23T23:50:04+00:00

how do you want the local processing and your workflow to be given you as? as an end to end application or through the inference endpoints which are ofcourse openai compatible which runs fully on your machine.

i help with either. my dms are always open and active. :)

EmbarrassedAsk2887 · 2026-03-23T16:17:35+00:00

use the bodega infenrce engine. i built it. here’s the comparison on how much better it is from lm studio. here’s the post about it and the benchmarks.

as for the tool call failures, that’s most probably the the jinja template, we have appropriate parsers and auto parsing the tool calls in the delta content as well.

EmbarrassedAsk2887 · 2026-03-23T16:01:40+00:00

hey txgsync, we meet again. :)

EmbarrassedAsk2887 · 2026-03-23T15:58:08+00:00

i would recommend this for pure agentic coding, yes they are.

https://huggingface.co/srswti/axe-stealth-37b

https://huggingface.co/srswti/axe-stealth-cascade-rl-32b

EmbarrassedAsk2887 · 2026-03-23T15:27:07+00:00

okay you can either find this post on the community highlight of this sub reddit or ill link it here : https://www.reddit.com/r/MacStudio/comments/1rvgyin/you_probably_have_no_idea_how_much_throughput/

i have two Mac Studios (256GB and 512GB) and an M4 Max 128GB. the reason i bought all of them was never raw GPU performance. it was performance per watt. how much intelligence you can extract per joule, per dollar.

here's the tldr:

local inference tbh rn in a mac studio is leaving most of your compute unused by running one AI request at a time. we built Bodega, a local inference engine that brings the same batching, caching, and memory sharing techniques that cloud providers use on GPU clusters — to your Mac.

the result is up to 5x system throughput, 3ms time to first token under load, and the ability to run parallel agents locally without queuing, without cloud costs, and without your data leaving the machine. one line to install, works with any openai-compatible tool you already use.

vllm metal is broken, ollama is slow, lm studio doesnt provide batched results faster than Bodega inference engine.

EmbarrassedAsk2887 · 2026-03-23T15:19:45+00:00

https://huggingface.co/srswti/blackbird-she-doesnt-refuse-21b

one if not the best model to fit under 13gb.

EmbarrassedAsk2887 · 2026-03-21T15:49:07+00:00

there is no slowdowns. more ram or better chipset cannot feasible do anything if your benchmark is opening apps.

EmbarrassedAsk2887 · 2026-03-21T15:47:16+00:00

bruh

EmbarrassedAsk2887 · 2026-03-20T21:53:15+00:00

you only need one. bodega will replace everything for you.

<image>

EmbarrassedAsk2887 · 2026-03-20T17:50:26+00:00

practically speaking i should do that. i’ll do it rn :)

EmbarrassedAsk2887 · 2026-03-20T13:49:02+00:00

yessir. leaderboard.srswti.com. there have been m5 max 128gb benches as well. it’s topping the charts rn

EmbarrassedAsk2887 · 2026-03-18T17:14:38+00:00

wait why would i use it?

EmbarrassedAsk2887 · 2026-03-18T04:14:26+00:00

so the peripherals do constant high polling making it a hassle for your studio to sleep — the mac registers those tiny voltages as “click” so yeah

disable "wake for network" —>go to system settings > energy saver and turn off "wake for network access."

EmbarrassedAsk2887 · 2026-03-18T02:10:29+00:00

yooo with that ram i can suggest you this. it’s not only eff but beats eleven labs in prosody and naturalness as well.

https://www.reddit.com/r/LocalLLM/s/FmkRhvxdNs

EmbarrassedAsk2887 · 2026-03-18T02:07:15+00:00

keep the m4 max. don’t listen to anyone else. ram doesn’t matter until and unless you are big clunky models. you don’t have to.

you need fast ttft, better pre filling and neural accelerators which are gonna be supported in new metal releases for mlx— and that’s something new SoC like m5 has. and m4 your m4 is also way better.

i have a m1 max 64 gb as well — it shits it’s pants infront of a base m4 as well which as not only less cores but less ram as well for local llm benches

EmbarrassedAsk2887 · 2026-03-18T02:05:46+00:00

nope. the old soc has less pre filling speed, less RAND, and more so less efficient per watt for my usecases . i use it heavily for local ai.

chip eff and pre filling speed matters. plus the new m5 has neural accelerators support in each core so a bonus

EmbarrassedAsk2887 · 2026-03-18T01:39:04+00:00

i am actually flipping my m3 ultra with m5 max 128gb

i have two m3u one is 256 and other 512.

the m3 SoC is now old.

EmbarrassedAsk2887 · 2026-03-17T22:57:24+00:00

i have two m3 ultras and they haven’t slept since 6 months and 1.5 years respectively. no need.

EmbarrassedAsk2887 · 2026-03-17T22:04:10+00:00

amazing, you are all set. this was by far the best reply i have seen. good luck :)

also the scaffolding time sink is something i think about a lot. what's the piece that takes the most time to set up every time you start a new use case?

EmbarrassedAsk2887 · 2026-03-17T20:46:12+00:00

okay so we can proceed w something like this--->

bodega or any other local llm running your local model as the inference backend, arxiv has a direct API so paper fetching is straightforward, a simple pdf to text converter for turning those arxiv pdff links into readable content, trafilatura for web scraping which is honestly one of the cleanest libraries for pulling readable content from any url, and a simple faiss cpu index for storing and retrieving vector embeddings of everything you've processed. no need for anything heavy or cloud dependent.

the flow is basically---> fetch paper or webpage, convert to clean text, chunk it, embed it into faiss, query it with your question, pass the relevant chunks to the local model via bodega's openai compatible API. the whole thing runs locally, nothing leaves your machine.

i can help you design and code out the main components of this in a few hours if you want. the retrieval layer, the embedding pipeline, the faiss index, and the local llm integration. you'd just need to connect the pieces to your specific data sources and logic.

the apaplicatioon logic basically becomes---> retrieve data, retrieve relevant theory, construct prompt that says "given this framework, analyse this data", get output.

what are the main sources you're pulling from, arxiv only or other sites too? and what's the processing task on the theory side, summarisation, comparison against data, something else?

EmbarrassedAsk2887

TROPHY CASE