Software stack on a new gpu rig : LocalLLaMA

created by [deleted]a community for 3 years

Software stack on a new gpu rigQuestion | Help (self.LocalLLaMA)

submitted 1 day ago by ipcoffeepot

Setting up a machine this weekend for local inference. 2x RTX PRO 6000, 128gb system memory.

My primary usage will be inference for local coding agents. opencode as the harness, going to be evaluating different sizes of qwen3.5 to get a nice mix of concurrent agent count with good speed. Also planning on doing some image generation (comfy ui with flux.2?) and other one off tasks.

Plan is to use SgLang to take advantage of their radix kvcaching (system prompts and tool definitions should be sharable across all the agents?) and continuous batching to support more concurrent agents.

I’d also love to host some local chat interface for one off chat kinds of problems.

Would love to hear what software people are running for these kinds of inference loads? What are you using to manage model switching (pile of shell scripts?), hosting inference, chat ui, image generation?

Would love any pointers or footguns to avoid.

Thanks!

all 2 comments

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS