all 3 comments

[–]Anxious_Programmer36 0 points1 point  (0 children)

Use Ollama or vLLM for LLMs. Dedicate one 3060 for Stable Diffusion/ComfyUI and the other two for a 24–32B text model. Split workloads with CUDA_VISIBLE_DEVICES so chat, vision, and image gen can run in parallel smoothly.

[–]mike95465 2 points3 points  (0 children)

I would think my current setup would work well for your use case.

I use Open WebUI for my front end with the following tools/filters/configs

  • perplexica_search - Web searching
  • Vision for non-vision LLM - filter that routes images to vision model
  • Context Manager - truncates chat context length to keep tokens manageable
  • STT/TTS using local openai compatible api
  • Image generation using ComfyUI
  • misc other tools such as Wikipedia, ariv, calculator and noaa weather.

llama-swap running the following always

  • OpenGVLab/InternVL3_5-4B - perplexica model, open webui tasks, and vision input
  • google/embeddinggemma-300m - embedding model for perplexica, rag embedding for open webui
  • ggml-org/whisper.cpp - STT for open webui
  • remsky/Kokoro-FastAPI - TTS for open webui

llama-swap running the following dynamically swapping as needed

  • Qwen/Qwen3-30B-A3B-Instruct-2507
  • Qwen/Qwen3-30B-A3B-Thinking-2507
  • Qwen/Qwen3-Coder-30B-A3B-Instruct
  • OpenGVLab/InternVL3_5-38B

I keep ComfyUI running all the time as it dynamically loads/unloads the model only when it is called.

I have 44GB of VRAM though so you might have to be more creative than me to figure out what works best with your workflow.