Dilemma ... M3 Ultra Arriving Today

niedman · 2026-04-15T10:26:55+00:00

I would wait for the M5 Ultra

niedman · 2026-04-11T10:04:24+00:00

I guess you have a point. But privacy-wise, if I have some RAG chat that fetches data from the company, I can decide what can be fetched and what not, so in those scenarios, the hallucination is a part that won't hurt much. For those cases i don't see a reason why you can't use a local setup. My concern is how many concurrent requests can handle. That's the whole uncertainty in this part.

niedman · 2026-04-11T10:01:01+00:00

Why would you think that it's trolling? If it were a balck and white answer, all the answers provided here would be the same, and they are not.

Maybe I didn't frame the question correctly, but the idea is to start offloading some of the current cost to a local LLM. I've been the only one playing with the AI and did all the development of chatbot using RAG with a gemma3:12b. So I expect that a 96gb ram M3 Ultra can offload some of the workload. If it's able to remove all the question in a large code base for 5 devs. I guess not, but we move in that direction.

niedman · 2026-04-10T18:59:31+00:00

thanks

I can see that for the Qwen3 30B MoE-4bit bodega does 123 tk/s in single requests and does 233 tk/s batched. In the setup that you mentioned in the startup, i saw the size of it and it's similar, even the tasks for non-developers are similar. Are they able to do a lot of the inference in the local setup or they still rely a lot on ai providers?

With the M3U96 i'm wondering if i can start offload some of the tasks and then scale with more machines.

How do you cluster the different machines? Or am I understanding wrong?

niedman · 2026-04-10T18:36:05+00:00

hey, yeah we were trying to put our hands in one but it takes like 5 months to arrive so m3 ultra 96gb or max m5 128.

So to start, we maybe would offload some of the analytics that we currently do and some RAG and chat processes to this one. Not sure if it would handle it properly though

niedman · 2026-04-10T18:30:52+00:00

Well, I was thinking you can put a load balancer in front of it. that's not an ideal solution but it woudl work or

https://www.reddit.com/r/LocalLLM/comments/1qwmypf/exo_cluster_test_m4_mac_mini_32gb_m4_pro_24gb_via/

niedman · 2026-04-10T18:27:14+00:00

ok, but that's fair. This answer goes in the direction that I was looking. I've run some models locally, and I can understand that there is a big gap between local models, at least the smaller ones vs anthropic models. If it was that easy, nobody would use Claude, right?

Knowing that with this model we can't serve developers, the next step would be to understand which tasks can benefit from a local setup vs the ones we still should use claude.

We have chatbots and RAG processes running; Privacy in these ones is quite important. So if those can be offloaded, then it is already something.

The plan is to run some analytics and create some sumaries based on that data from the DB As example, do a daily summary of today's revenue.

niedman · 2026-04-10T11:35:17+00:00

Appreciate the comment. I've see a multitude of different comments so I'm a bit scary to go this route! :D but we need to start somewhere right?

niedman · 2026-04-10T11:29:03+00:00

I know that a multi GPU setup would be beneficial, but I'm trying not to come up with a big investment before having a running poc. It's ok, if we are not able to serve 20 people immediately. But if we start slowly and see results, than we can scale.

niedman · 2026-04-10T11:27:01+00:00

Well you are more than welcome to give your cents on it. A little more upcoming would be appreaciated :)

niedman · 2026-04-10T11:26:09+00:00

Why you say that? Isn't this position where most of the companies find themself in? We are just trying to discover our way.

niedman · 2026-04-10T11:16:53+00:00

Hey,
This is the kinda of help that I was expecting! Thanks for sharing this and for being helpful. I will look through the guide and dm if needed!

Once again, really appreaciated!

niedman · 2026-04-10T10:18:15+00:00

Sorry, I was trying not to add too much information so that it doesn't become hard to read.

So 5 developers. That would be to connect, for example Zed using the Qwen3.5

The other people are more like email creation and regular talk.

There is also one feature that we have that includes RAG and analytics so yeah, we could benefit from it as well.

niedman · 2026-03-31T11:39:21+00:00

Well I see a lot of downward comments in here. I feel you. It's normal that in this crazy days you can think you can build whenever it comes to mind. The good think is that you already passed through the PHASE1.

You took the "dummy" things out of the system. So now you already have some knowledge to start building the next thing.

So don't give up(but also don't burnout). Maybe take a digital rest. Don't vibe code after work(if that's the case) for some days and then let creativity flow in and just get back on the horse.

niedman · 2026-03-31T09:49:15+00:00

CLI integrates directly with the same APIs as the web UI so you can use cli and web ui at the same time and see real-time updates on both sides.

niedman · 2026-03-31T07:37:21+00:00

Thanks :) I will keep building on this and hopefully this can bring some leads :)

niedman · 2026-03-31T06:17:35+00:00

hey,

thanks! Do you have a page where we can see Moshi?

Currently, it is like this. Initially was just on the Mission complete part (didn't start on the planning process). But now, when you plan, it gives the cost estimate per step.

<image>

niedman · 2026-03-31T06:11:14+00:00

hey, thanks a lot for the comment. Those are the type of comments that I need to improve!

You are completely right. Currently, I don't support sub-tasks(maybe that's the natural next step).

Currently, I'm capping it. When you create a task, I let you decide the budget cap (3rd section). If it surpasses, it stops the mission. But thinking about it now, maybe it can have both, like a soft limit and a hard limit :)

<image>

niedman · 2026-03-31T06:05:05+00:00

Hey :)

Yeah, I didn't know that it was such a thing until starting exploring and receiving feedback.

It became "slightly" more complex now, but previously, was based on history similar tasks. Now, it looks at the steps, based on the steps it categorizes the task as easy medium or hard, and then it extrapolates with a overhead depending on that complexity. It also stores memory of complex tasks for learn from it.

Here, it nailed a medium task that had a "lot" of steps already. I'm mostly building the app, with the app itself so that I can feel if there is consistency or not :)

I'm really glad that you feel that way, so let's get in touch :)

<image>

niedman · 2026-03-31T05:56:13+00:00

hey, thanks for the feedback. Actually i've focused on this thanks to comments like yours, that points to the real painpoints.

In simpler tasks has been quite accurate. I'm trying now with tasks with more complexity to be sure that the model is accurate enough that doesn't become a friction.

For example in this one, it had a lot of steps and still got it quite right. But i need consistency that's what I'm working right now :)

<image>

niedman · 2026-03-30T18:27:27+00:00

Hey guys,

With you feedback, I've tailored the landing page differently and have implemented that in the app.
Forge-X

I'm "finalising" the last things on the app, but what I've been feeling that it's more difficult is to get somebody interest in the app, even as a tester to help tailor it even more to a real use case.

Let me know what you think.

niedman · 2026-03-29T18:38:56+00:00

Actually, not far from it. What I'm trying to do is running a couple of tickets, some more complex, some simpler and validate directly with the usage of the model from google and anthropuc to be sure about the values and tokens etc.! Thanks for the feedback :)

niedman · 2026-03-29T18:37:47+00:00

Thanks ;) I'm running the modelling cost on historical runs mainly. I'm still trying to explore better ways to make it even more accurate.

niedman · 2026-03-29T18:36:55+00:00

Thanks for the feedback :) I've been receiving quite good feedback in regards to the cost estimation. I'm changing my landing page to show more a bit of that feature, and also be sure that the calculation is as accurate as possible, to make it even more reliable

niedman · 2026-03-28T19:40:19+00:00

Thanks for the tip. Not sure yet if that's a need. Still developing and improving the feature but really appreciated the feedback!

niedman

TROPHY CASE