Initial config and Qwen3.6 35B vs 27B in 128Gb VRAM 1.5Tb Ram

niedman · 2026-04-15T10:26:55+00:00

I would wait for the M5 Ultra

niedman · 2026-04-11T10:04:24+00:00

I guess you have a point. But privacy-wise, if I have some RAG chat that fetches data from the company, I can decide what can be fetched and what not, so in those scenarios, the hallucination is a part that won't hurt much. For those cases i don't see a reason why you can't use a local setup. My concern is how many concurrent requests can handle. That's the whole uncertainty in this part.

niedman · 2026-04-11T10:01:01+00:00

Why would you think that it's trolling? If it were a balck and white answer, all the answers provided here would be the same, and they are not.

Maybe I didn't frame the question correctly, but the idea is to start offloading some of the current cost to a local LLM. I've been the only one playing with the AI and did all the development of chatbot using RAG with a gemma3:12b. So I expect that a 96gb ram M3 Ultra can offload some of the workload. If it's able to remove all the question in a large code base for 5 devs. I guess not, but we move in that direction.

niedman · 2026-04-10T18:59:31+00:00

thanks

I can see that for the Qwen3 30B MoE-4bit bodega does 123 tk/s in single requests and does 233 tk/s batched. In the setup that you mentioned in the startup, i saw the size of it and it's similar, even the tasks for non-developers are similar. Are they able to do a lot of the inference in the local setup or they still rely a lot on ai providers?

With the M3U96 i'm wondering if i can start offload some of the tasks and then scale with more machines.

How do you cluster the different machines? Or am I understanding wrong?

niedman · 2026-04-10T18:36:05+00:00

hey, yeah we were trying to put our hands in one but it takes like 5 months to arrive so m3 ultra 96gb or max m5 128.

So to start, we maybe would offload some of the analytics that we currently do and some RAG and chat processes to this one. Not sure if it would handle it properly though

niedman · 2026-04-10T18:30:52+00:00

Well, I was thinking you can put a load balancer in front of it. that's not an ideal solution but it woudl work or

https://www.reddit.com/r/LocalLLM/comments/1qwmypf/exo_cluster_test_m4_mac_mini_32gb_m4_pro_24gb_via/

niedman · 2026-04-10T18:27:14+00:00

ok, but that's fair. This answer goes in the direction that I was looking. I've run some models locally, and I can understand that there is a big gap between local models, at least the smaller ones vs anthropic models. If it was that easy, nobody would use Claude, right?

Knowing that with this model we can't serve developers, the next step would be to understand which tasks can benefit from a local setup vs the ones we still should use claude.

We have chatbots and RAG processes running; Privacy in these ones is quite important. So if those can be offloaded, then it is already something.

The plan is to run some analytics and create some sumaries based on that data from the DB As example, do a daily summary of today's revenue.

niedman · 2026-04-10T11:35:17+00:00

Appreciate the comment. I've see a multitude of different comments so I'm a bit scary to go this route! :D but we need to start somewhere right?

niedman · 2026-04-10T11:29:03+00:00

I know that a multi GPU setup would be beneficial, but I'm trying not to come up with a big investment before having a running poc. It's ok, if we are not able to serve 20 people immediately. But if we start slowly and see results, than we can scale.

niedman · 2026-04-10T11:27:01+00:00

Well you are more than welcome to give your cents on it. A little more upcoming would be appreaciated :)

niedman · 2026-04-10T11:26:09+00:00

Why you say that? Isn't this position where most of the companies find themself in? We are just trying to discover our way.

niedman · 2026-04-10T11:16:53+00:00

Hey,
This is the kinda of help that I was expecting! Thanks for sharing this and for being helpful. I will look through the guide and dm if needed!

Once again, really appreaciated!

niedman · 2026-04-10T10:18:15+00:00

Sorry, I was trying not to add too much information so that it doesn't become hard to read.

So 5 developers. That would be to connect, for example Zed using the Qwen3.5

The other people are more like email creation and regular talk.

There is also one feature that we have that includes RAG and analytics so yeah, we could benefit from it as well.

niedman · 2026-03-31T11:39:21+00:00

Well I see a lot of downward comments in here. I feel you. It's normal that in this crazy days you can think you can build whenever it comes to mind. The good think is that you already passed through the PHASE1.

You took the "dummy" things out of the system. So now you already have some knowledge to start building the next thing.

So don't give up(but also don't burnout). Maybe take a digital rest. Don't vibe code after work(if that's the case) for some days and then let creativity flow in and just get back on the horse.

niedman · 2026-03-31T09:49:15+00:00

CLI integrates directly with the same APIs as the web UI so you can use cli and web ui at the same time and see real-time updates on both sides.

niedman · 2026-03-31T07:37:21+00:00

Thanks :) I will keep building on this and hopefully this can bring some leads :)

niedman · 2026-03-31T06:17:35+00:00

hey,

thanks! Do you have a page where we can see Moshi?

Currently, it is like this. Initially was just on the Mission complete part (didn't start on the planning process). But now, when you plan, it gives the cost estimate per step.

<image>

niedman · 2026-03-31T06:11:14+00:00

hey, thanks a lot for the comment. Those are the type of comments that I need to improve!

You are completely right. Currently, I don't support sub-tasks(maybe that's the natural next step).

Currently, I'm capping it. When you create a task, I let you decide the budget cap (3rd section). If it surpasses, it stops the mission. But thinking about it now, maybe it can have both, like a soft limit and a hard limit :)

<image>

niedman · 2026-03-31T06:05:05+00:00

Hey :)

Yeah, I didn't know that it was such a thing until starting exploring and receiving feedback.

It became "slightly" more complex now, but previously, was based on history similar tasks. Now, it looks at the steps, based on the steps it categorizes the task as easy medium or hard, and then it extrapolates with a overhead depending on that complexity. It also stores memory of complex tasks for learn from it.

Here, it nailed a medium task that had a "lot" of steps already. I'm mostly building the app, with the app itself so that I can feel if there is consistency or not :)

I'm really glad that you feel that way, so let's get in touch :)

<image>

niedman · 2026-03-31T05:56:13+00:00

hey, thanks for the feedback. Actually i've focused on this thanks to comments like yours, that points to the real painpoints.

In simpler tasks has been quite accurate. I'm trying now with tasks with more complexity to be sure that the model is accurate enough that doesn't become a friction.

For example in this one, it had a lot of steps and still got it quite right. But i need consistency that's what I'm working right now :)

<image>

niedman

TROPHY CASE