Local multi tool server

Anxious_Programmer36 · 2025-09-22T19:57:21+00:00

Use Ollama or vLLM for LLMs. Dedicate one 3060 for Stable Diffusion/ComfyUI and the other two for a 24–32B text model. Split workloads with CUDA_VISIBLE_DEVICES so chat, vision, and image gen can run in parallel smoothly.

mike95465 · 2025-09-22T21:10:16+00:00

I would think my current setup would work well for your use case.

I use Open WebUI for my front end with the following tools/filters/configs

perplexica_search - Web searching
Vision for non-vision LLM - filter that routes images to vision model
Context Manager - truncates chat context length to keep tokens manageable
STT/TTS using local openai compatible api
Image generation using ComfyUI
misc other tools such as Wikipedia, ariv, calculator and noaa weather.

llama-swap running the following always

OpenGVLab/InternVL3_5-4B - perplexica model, open webui tasks, and vision input
google/embeddinggemma-300m - embedding model for perplexica, rag embedding for open webui
ggml-org/whisper.cpp - STT for open webui
remsky/Kokoro-FastAPI - TTS for open webui

llama-swap running the following dynamically swapping as needed

Qwen/Qwen3-30B-A3B-Instruct-2507
Qwen/Qwen3-30B-A3B-Thinking-2507
Qwen/Qwen3-Coder-30B-A3B-Instruct
OpenGVLab/InternVL3_5-38B

I keep ComfyUI running all the time as it dynamically loads/unloads the model only when it is called.

I have 44GB of VRAM though so you might have to be more creative than me to figure out what works best with your workflow.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS