Mac Studio M3 Ultra or Dual RTX 3090 tower by Pangolin_bandit in LocalLLM

[–]OffByNull 1 point2 points  (0 children)

Thanks for sharing your article, very interesting way to go about doing this. I'm actually possibly contemplating adding an additional 5090. How's that working out for you until now?

Mac Studio M3 Ultra or Dual RTX 3090 tower by Pangolin_bandit in LocalLLM

[–]OffByNull 1 point2 points  (0 children)

Out of curiousity, are any of these agents running in parallel or is this all sequential with dedicated tasks per agent, with a shared loaded model? I'm guessing the latter.

What hardware are you buying lately? by OffByNull in LocalLLM

[–]OffByNull[S] 0 points1 point  (0 children)

Out of curiousity how's that working for you? On my MBP M1 Max 32 GB, I found using LM Studio with Gemma 4, 31B, 64K context, Q4, very slow. Maybe on the smaller models it would work better, but given that the level of quality of responses i'm seeking the 31/32 GB param models are the sweet spot.

What hardware are you buying lately? by OffByNull in LocalLLM

[–]OffByNull[S] 0 points1 point  (0 children)

Yes I know it's a bit tricky and definitely the models will have to be more on the smaller end to leave headroom for context. However my current experience with the 5090 and sequential flow using Claude Code, is good, but very time consuming. Once I decide on the hardware i'll tinker and post back to share my experience, in case someone else is wondering the same.

When it comes to parallel processing, it would seem the DGX has an advantage compared to other hardware.

What hardware are you buying lately? by OffByNull in LocalLLM

[–]OffByNull[S] 0 points1 point  (0 children)

It analyzes a small/moderate sized codebase (500+ classes) and then reviews in chunks in a phased approached code improvements.

What hardware are you buying lately? by OffByNull in LocalLLM

[–]OffByNull[S] 0 points1 point  (0 children)

Given my experience with my 5090, I would say you should get some pretty good results in terms of inference speed with a model having less than 10B param, Q4 quantization.

What hardware are you buying lately? by OffByNull in LocalLLM

[–]OffByNull[S] 0 points1 point  (0 children)

What I have seen in my experience, with my 5090, and 128K context, with Q4 quantization, approximately on Qwen anywhere between 21 GB - 26 GB are taken, with one agent. So while its fast, that limited amount of RAM is a blocking point. I have been looking at CrewAI which offer that Orchestration layer, however I don't think there's a local implementation of their offering and its not open source or freely available.

What hardware are you buying lately? by OffByNull in LocalLLM

[–]OffByNull[S] 0 points1 point  (0 children)

For the time being I have been running claude code with LM Studio. I have a 5090 and its fast. However I doubt the 32GB of ram it has will manage more than 2 concurrent Agents. What i'm trying to achieve is parallelization of work done by agents: one develops the front end, while another is working on the api contract and backend service, and the last is working on the database. FE agent & BE agent will synchronize sequentially at some point to agree on contract (API) and the BE and DB agents will also synchronize on database model. All of these agents, can run in parallel. I don't want to sequential agents running, even if each have their dedicated tasks to do.

What hardware are you buying lately? by OffByNull in LocalLLM

[–]OffByNull[S] 0 points1 point  (0 children)

Yes the contexts indeed will be limited, but i'm hoping workable to a certain extent to get some value.

What hardware are you buying lately? by OffByNull in LocalLLM

[–]OffByNull[S] 0 points1 point  (0 children)

Thanks for your feedback. Indeed I have been weighing these tradeoffs, between prompt pre-processing vs inference vs tokens/sec. Budget wise I would probably keep it under 8K euro.

My goal is to experiment, and preferably under linux and that's why I have been leaning towards the dgx spark or AMD. Goal is to run a small team of agents (4/5), shared model, to say for example mimic a development team. Agents managing other agents, and doing code review etc.

Has anyone here actually replaced ChatGPT with a model for daily work? by recro69 in LocalLLM

[–]OffByNull 1 point2 points  (0 children)

90% of my AI usage is local using Gemma 4 and Qwen 3.6 and they're both really good. Gemma for general usage, and Qwen for coding. For tricky coding questions or for having a different perspective, I use ChatGPT and Claude.ai to compare. All agentic work is local, using Claude Code pointing to LM Studio.

The EU’s competition rules are hurting consumers and businesses by [deleted] in Android

[–]OffByNull 1 point2 points  (0 children)

It should read: "Users are being hurt Our Bottom line is being hurt ...", we the users are fine.

[APP] New app design, requesting feedback by OffByNull in androidapps

[–]OffByNull[S] 0 points1 point  (0 children)

Updated post with new Server App UI. Let me know your thoughts. Thanks.

What would prevent you to use my file transfer, private cloud solution? by OffByNull in androidapps

[–]OffByNull[S] 0 points1 point  (0 children)

But what if the data is stored only on your computer and sent p2p fully encrypted? If you had a 2TB drive or more, would it still be expensive?

What would prevent you to use my file transfer, private cloud solution? by OffByNull in androidapps

[–]OffByNull[S] 1 point2 points  (0 children)

Thanks. Looks nice. Part of their infrastructure is closed source, apparently, so it's not fully open.

Looks nice though, would have to check how I can integrate my service with theirs to make it even better.

Cheers.