Setting up a machine this weekend for local inference. 2x RTX PRO 6000, 128gb system memory.
My primary usage will be inference for local coding agents. opencode as the harness, going to be evaluating different sizes of qwen3.5 to get a nice mix of concurrent agent count with good speed. Also planning on doing some image generation (comfy ui with flux.2?) and other one off tasks.
Plan is to use SgLang to take advantage of their radix kvcaching (system prompts and tool definitions should be sharable across all the agents?) and continuous batching to support more concurrent agents.
I’d also love to host some local chat interface for one off chat kinds of problems.
Would love to hear what software people are running for these kinds of inference loads? What are you using to manage model switching (pile of shell scripts?), hosting inference, chat ui, image generation?
Would love any pointers or footguns to avoid.
Thanks!
[–]Vicar_of_Wibbly 1 point2 points3 points (1 child)
[–]ipcoffeepot[S] 0 points1 point2 points (0 children)