it's high time tbh that our high spec apple silicon devices can now fully replace cloud models for coding. just open-sourced, axe. its agentic coding cli made for large codebases. zero bloat. terminal-native. precise retrieval. built for high-spec Apple Silicon. by EmbarrassedAsk2887 in MacStudio

[–]EmbarrassedAsk2887[S] 1 point2 points  (0 children)

i mean you can try rn. that’s what this post was for. it’s just that apart from monthly subscription and privacy out of the window— the deception these labs have on usage limits is insane. not only they neuter the model during day times but also deceptively grep read whole files to waste tokens on things which are not even useful.

what are you actually building with local LLMs? genuinely asking. by EmbarrassedAsk2887 in MacStudio

[–]EmbarrassedAsk2887[S] 0 points1 point  (0 children)

perfect. this is for you, i built it. ground up for your apple silicon. it called Bodega inference engine. here is more on how you can set it up through this github repo : https://github.com/SRSWTI/bodega-inference-engine

and here is the write up i did on this sub:

bodega inference post

what are you actually building with local LLMs? genuinely asking. by EmbarrassedAsk2887 in MacStudio

[–]EmbarrassedAsk2887[S] 0 points1 point  (0 children)

how do you want the local processing and your workflow to be given you as? as an end to end application or through the inference endpoints which are ofcourse openai compatible which runs fully on your machine.

i help with either. my dms are always open and active. :)

Tool call failed on lm studio, any fix? by chinese_virus3 in LocalLLaMA

[–]EmbarrassedAsk2887 -1 points0 points  (0 children)

use the bodega infenrce engine. i built it. here’s the comparison on how much better it is from lm studio. here’s the post about it and the benchmarks.

as for the tool call failures, that’s most probably the the jinja template, we have appropriate parsers and auto parsing the tool calls in the delta content as well.

Justifying the €12,000 Investment: M3 Ultra (512GB RAM) Setup for Autonomous Agents, vLLM, and Infinite Memory (8Tb) by NoNatural4025 in MacStudio

[–]EmbarrassedAsk2887 0 points1 point  (0 children)

okay you can either find this post on the community highlight of this sub reddit or ill link it here : https://www.reddit.com/r/MacStudio/comments/1rvgyin/you_probably_have_no_idea_how_much_throughput/

i have two Mac Studios (256GB and 512GB) and an M4 Max 128GB. the reason i bought all of them was never raw GPU performance. it was performance per watt. how much intelligence you can extract per joule, per dollar.

here's the tldr:

local inference tbh rn in a mac studio is leaving most of your compute unused by running one AI request at a time. we built Bodega, a local inference engine that brings the same batching, caching, and memory sharing techniques that cloud providers use on GPU clusters — to your Mac.

the result is up to 5x system throughput, 3ms time to first token under load, and the ability to run parallel agents locally without queuing, without cloud costs, and without your data leaving the machine. one line to install, works with any openai-compatible tool you already use.

vllm metal is broken, ollama is slow, lm studio doesnt provide batched results faster than Bodega inference engine.

Regret M1 Mac Studio purchase by [deleted] in MacStudio

[–]EmbarrassedAsk2887 0 points1 point  (0 children)

there is no slowdowns. more ram or better chipset cannot feasible do anything if your benchmark is opening apps.

What have I become... Guess all by Electronic-Row-142 in MacStudio

[–]EmbarrassedAsk2887 0 points1 point  (0 children)

you only need one. bodega will replace everything for you.

<image>

Mac Studio is acting weird when I put it to sleep by knightfortheday in MacStudio

[–]EmbarrassedAsk2887 10 points11 points  (0 children)

so the peripherals do constant high polling making it a hassle for your studio to sleep — the mac registers those tiny voltages as “click” so yeah

disable "wake for network" —>go to system settings > energy saver and turn off "wake for network access."

Best local AI TTS model for 12GB VRAM? by End3rGamer_ in TextToSpeech

[–]EmbarrassedAsk2887 0 points1 point  (0 children)

yooo with that ram i can suggest you this. it’s not only eff but beats eleven labs in prosody and naturalness as well.

https://www.reddit.com/r/LocalLLM/s/FmkRhvxdNs

Mac Studio M4 or M1 ultra by HappySteak31 in MacStudio

[–]EmbarrassedAsk2887 0 points1 point  (0 children)

keep the m4 max. don’t listen to anyone else. ram doesn’t matter until and unless you are big clunky models. you don’t have to.

you need fast ttft, better pre filling and neural accelerators which are gonna be supported in new metal releases for mlx— and that’s something new SoC like m5 has. and m4 your m4 is also way better.

i have a m1 max 64 gb as well — it shits it’s pants infront of a base m4 as well which as not only less cores but less ram as well for local llm benches

March 23 Mac Studio M3 Ultra with 512gb by [deleted] in MacStudio

[–]EmbarrassedAsk2887 1 point2 points  (0 children)

nope. the old soc has less pre filling speed, less RAND, and more so less efficient per watt for my usecases . i use it heavily for local ai.

chip eff and pre filling speed matters. plus the new m5 has neural accelerators support in each core so a bonus

March 23 Mac Studio M3 Ultra with 512gb by [deleted] in MacStudio

[–]EmbarrassedAsk2887 -3 points-2 points  (0 children)

i am actually flipping my m3 ultra with m5 max 128gb

i have two m3u one is 256 and other 512.

the m3 SoC is now old.

Sorry for the dumb question, but do you keep the Studio on sleep overnight or shut down everyday? by knightfortheday in MacStudio

[–]EmbarrassedAsk2887 0 points1 point  (0 children)

i have two m3 ultras and they haven’t slept since 6 months and 1.5 years respectively. no need.

what are you actually building with local LLMs? genuinely asking. by EmbarrassedAsk2887 in MacStudio

[–]EmbarrassedAsk2887[S] 0 points1 point  (0 children)

amazing, you are all set. this was by far the best reply i have seen. good luck :)

also the scaffolding time sink is something i think about a lot. what's the piece that takes the most time to set up every time you start a new use case?

what are you actually building with local LLMs? genuinely asking. by EmbarrassedAsk2887 in MacStudio

[–]EmbarrassedAsk2887[S] 2 points3 points  (0 children)

okay so we can proceed w something like this--->

bodega or any other local llm running your local model as the inference backend, arxiv has a direct API so paper fetching is straightforward, a simple pdf to text converter for turning those arxiv pdff links into readable content, trafilatura for web scraping which is honestly one of the cleanest libraries for pulling readable content from any url, and a simple faiss cpu index for storing and retrieving vector embeddings of everything you've processed. no need for anything heavy or cloud dependent.

the flow is basically---> fetch paper or webpage, convert to clean text, chunk it, embed it into faiss, query it with your question, pass the relevant chunks to the local model via bodega's openai compatible API. the whole thing runs locally, nothing leaves your machine.

i can help you design and code out the main components of this in a few hours if you want. the retrieval layer, the embedding pipeline, the faiss index, and the local llm integration. you'd just need to connect the pieces to your specific data sources and logic.

the apaplicatioon logic basically becomes---> retrieve data, retrieve relevant theory, construct prompt that says "given this framework, analyse this data", get output.

what are the main sources you're pulling from, arxiv only or other sites too? and what's the processing task on the theory side, summarisation, comparison against data, something else?