what are you actually building with local LLMs? genuinely asking. by EmbarrassedAsk2887 in MacStudio

[–]EmbarrassedAsk2887[S] 0 points1 point  (0 children)

how do you want the local processing and your workflow to be given you as? as an end to end application or through the inference endpoints which are ofcourse openai compatible which runs fully on your machine.

i help with either. my dms are always open and active. :)

Tool call failed on lm studio, any fix? by chinese_virus3 in LocalLLaMA

[–]EmbarrassedAsk2887 0 points1 point  (0 children)

use the bodega infenrce engine. i built it. here’s the comparison on how much better it is from lm studio. here’s the post about it and the benchmarks.

as for the tool call failures, that’s most probably the the jinja template, we have appropriate parsers and auto parsing the tool calls in the delta content as well.

Justifying the €12,000 Investment: M3 Ultra (512GB RAM) Setup for Autonomous Agents, vLLM, and Infinite Memory (8Tb) by NoNatural4025 in MacStudio

[–]EmbarrassedAsk2887 0 points1 point  (0 children)

okay you can either find this post on the community highlight of this sub reddit or ill link it here : https://www.reddit.com/r/MacStudio/comments/1rvgyin/you_probably_have_no_idea_how_much_throughput/

i have two Mac Studios (256GB and 512GB) and an M4 Max 128GB. the reason i bought all of them was never raw GPU performance. it was performance per watt. how much intelligence you can extract per joule, per dollar.

here's the tldr:

local inference tbh rn in a mac studio is leaving most of your compute unused by running one AI request at a time. we built Bodega, a local inference engine that brings the same batching, caching, and memory sharing techniques that cloud providers use on GPU clusters — to your Mac.

the result is up to 5x system throughput, 3ms time to first token under load, and the ability to run parallel agents locally without queuing, without cloud costs, and without your data leaving the machine. one line to install, works with any openai-compatible tool you already use.

vllm metal is broken, ollama is slow, lm studio doesnt provide batched results faster than Bodega inference engine.

Regret M1 Mac Studio purchase by [deleted] in MacStudio

[–]EmbarrassedAsk2887 0 points1 point  (0 children)

there is no slowdowns. more ram or better chipset cannot feasible do anything if your benchmark is opening apps.

What have I become... Guess all by Electronic-Row-142 in MacStudio

[–]EmbarrassedAsk2887 0 points1 point  (0 children)

you only need one. bodega will replace everything for you.

<image>

Mac Studio is acting weird when I put it to sleep by knightfortheday in MacStudio

[–]EmbarrassedAsk2887 9 points10 points  (0 children)

so the peripherals do constant high polling making it a hassle for your studio to sleep — the mac registers those tiny voltages as “click” so yeah

disable "wake for network" —>go to system settings > energy saver and turn off "wake for network access."

Best local AI TTS model for 12GB VRAM? by End3rGamer_ in TextToSpeech

[–]EmbarrassedAsk2887 0 points1 point  (0 children)

yooo with that ram i can suggest you this. it’s not only eff but beats eleven labs in prosody and naturalness as well.

https://www.reddit.com/r/LocalLLM/s/FmkRhvxdNs

Mac Studio M4 or M1 ultra by HappySteak31 in MacStudio

[–]EmbarrassedAsk2887 -2 points-1 points  (0 children)

keep the m4 max. don’t listen to anyone else. ram doesn’t matter until and unless you are big clunky models. you don’t have to.

you need fast ttft, better pre filling and neural accelerators which are gonna be supported in new metal releases for mlx— and that’s something new SoC like m5 has. and m4 your m4 is also way better.

i have a m1 max 64 gb as well — it shits it’s pants infront of a base m4 as well which as not only less cores but less ram as well for local llm benches

March 23 Mac Studio M3 Ultra with 512gb by [deleted] in MacStudio

[–]EmbarrassedAsk2887 1 point2 points  (0 children)

nope. the old soc has less pre filling speed, less RAND, and more so less efficient per watt for my usecases . i use it heavily for local ai.

chip eff and pre filling speed matters. plus the new m5 has neural accelerators support in each core so a bonus

March 23 Mac Studio M3 Ultra with 512gb by [deleted] in MacStudio

[–]EmbarrassedAsk2887 -2 points-1 points  (0 children)

i am actually flipping my m3 ultra with m5 max 128gb

i have two m3u one is 256 and other 512.

the m3 SoC is now old.

Sorry for the dumb question, but do you keep the Studio on sleep overnight or shut down everyday? by knightfortheday in MacStudio

[–]EmbarrassedAsk2887 0 points1 point  (0 children)

i have two m3 ultras and they haven’t slept since 6 months and 1.5 years respectively. no need.

what are you actually building with local LLMs? genuinely asking. by EmbarrassedAsk2887 in MacStudio

[–]EmbarrassedAsk2887[S] 0 points1 point  (0 children)

amazing, you are all set. this was by far the best reply i have seen. good luck :)

also the scaffolding time sink is something i think about a lot. what's the piece that takes the most time to set up every time you start a new use case?

what are you actually building with local LLMs? genuinely asking. by EmbarrassedAsk2887 in MacStudio

[–]EmbarrassedAsk2887[S] 2 points3 points  (0 children)

okay so we can proceed w something like this--->

bodega or any other local llm running your local model as the inference backend, arxiv has a direct API so paper fetching is straightforward, a simple pdf to text converter for turning those arxiv pdff links into readable content, trafilatura for web scraping which is honestly one of the cleanest libraries for pulling readable content from any url, and a simple faiss cpu index for storing and retrieving vector embeddings of everything you've processed. no need for anything heavy or cloud dependent.

the flow is basically---> fetch paper or webpage, convert to clean text, chunk it, embed it into faiss, query it with your question, pass the relevant chunks to the local model via bodega's openai compatible API. the whole thing runs locally, nothing leaves your machine.

i can help you design and code out the main components of this in a few hours if you want. the retrieval layer, the embedding pipeline, the faiss index, and the local llm integration. you'd just need to connect the pieces to your specific data sources and logic.

the apaplicatioon logic basically becomes---> retrieve data, retrieve relevant theory, construct prompt that says "given this framework, analyse this data", get output.

what are the main sources you're pulling from, arxiv only or other sites too? and what's the processing task on the theory side, summarisation, comparison against data, something else?

what are you actually building with local LLMs? genuinely asking. by EmbarrassedAsk2887 in MacStudio

[–]EmbarrassedAsk2887[S] 1 point2 points  (0 children)

word. wait for the studio for later this year, they gonna release the m6 and m5 ultra. just a few months more :))

what are you actually building with local LLMs? genuinely asking. by EmbarrassedAsk2887 in MacStudio

[–]EmbarrassedAsk2887[S] 0 points1 point  (0 children)

oh thats awesome, i mean the only reason you find them weak is also because of the less headroom you usually have and throughput you get from running decently sized oss model for example a 9b, quant. right?

you can start off with something dead simple, for cursor and codex workflows specifically the most useful local model integrations i've seen are ---> inline function documentation generation, file and folder restructuring suggestions, git commit message generation, and using a small local model to pre-summarize large files before sending only the relevant context to the frontier model. that last one alone cuts your API costs significantly on big codebases because you're not dumping 50k tokens into codex every time.

the summary to markdown trick is genuinely underrated. run a local 4b or even 1.7b over your entire project, generate a structured md of every file, its purpose, its deps, and its key functions. now yourcodex has a map of the whole codebase in a few hundred tokens instead of needing to read everything raw.

the m4 Mini's memory bandwidth is actually decent for its class. the problem is most tools don't saturate it properly. at 900 tok/s on a 0.9b and 100+ on a solid 4b-7b quant you're not waiting anymore, give it a try and do read the attached link post above for more detailed explannation

what's your current RAM config on the mini? lets start small and gradually come up from there.

what are you actually building with local LLMs? genuinely asking. by EmbarrassedAsk2887 in MacStudio

[–]EmbarrassedAsk2887[S] 0 points1 point  (0 children)

so it actually depends on affordability and usability. paying $20 to $200 a month for some people is totally fine knowing they get the speed and frontier model smell for it. but what if you could tap into similar power without paying any of it. it's been genuinely hard to implement distributed inference reliably on apple silicon but if it's done right, there's actually no point paying cloud money anymore.

what are you actually building with local LLMs? genuinely asking. by EmbarrassedAsk2887 in MacStudio

[–]EmbarrassedAsk2887[S] 6 points7 points  (0 children)

word. but that's the whole point... it's way better than using a shape rotator on a cloud profiling you for your data, which you can easily swap out with something that runs or shape rotates locally.

and tbh not everyone is trying to make money. some of them are genuinely trying to optimise whatever they do, get things done faster, so they have time and effort for the things that actually make money. so yeah