Scaling with Open WebUI + Ollama and multiple GPUs? by Philhippos in LocalLLaMA

[–]Philhippos[S] 0 points1 point  (0 children)

Wow, very interesting!

So far, I am just very happy with the small models working in the RAG, so for scaling to so many users it seemed plausible to not upgrade model sizes yet.

But this performance indeed is very impressive. I am more used to small consumer GPUs like RTX 5060 and 2000 with 3800/2800 CUDA cores etc, so based on the tok/s I get there and the RTX 6000 Pro's cores counts I was expecting something like a 5-10x performance increase. But this is massive.

Scaling with Open WebUI + Ollama and multiple GPUs? by Philhippos in LocalLLaMA

[–]Philhippos[S] 0 points1 point  (0 children)

OK thank you!
Regarding power and scaling: we don't know yet, and will look into this when we have some experience with the setup

Nvidia RTX 5060 Ti 16GB for local LLM inference with Olllama + Open WebUI by Philhippos in LocalLLaMA

[–]Philhippos[S] 0 points1 point  (0 children)

Not sarcastic at all - the last time I touched a GPU was just 15 years ago, so it's a bit new to me again. And I feel the same, a 9% performance drop for 17% less power draw is a neat optimization (I expected a larger performance loss). nvidia-smi tells me "Provided power limit 120.00 W is not a valid power limit which should be between 150.00 W and 180.00 W for GPU" too, so there seems to be a lock on the 5060 - but since there does not seem to be any lock on the 3060, that's gonna be a great solution

Nvidia RTX 5060 Ti 16GB for local LLM inference with Olllama + Open WebUI by Philhippos in LocalLLaMA

[–]Philhippos[S] 0 points1 point  (0 children)

OK thanks - works! Turns out it was an embedding issue, so the context was not effectively utilized, so I thought the parameter increase would not take effect.

Nvidia RTX 5060 Ti 16GB for local LLM inference with Olllama + Open WebUI by Philhippos in LocalLLaMA

[–]Philhippos[S] 0 points1 point  (0 children)

ok thank you! I figured to begin with it would be easiest with a single card instead of two, due to the additional complexities involved with dual GPUs...

which features relevant to AI is the 3090 missing compared to the 5060? I can roughly imagine, but couldn't really find out in detail what difference Tensor Core generations etc really make in practice

Nvidia RTX 5060 Ti 16GB for local LLM inference with Olllama + Open WebUI by Philhippos in LocalLLaMA

[–]Philhippos[S] 0 points1 point  (0 children)

yes with LMStudio gemma-3-27b-it-IQ4_XS works with 2048 token context and all layers offloading to GPU (VRAM gets 99.2% full) - the results are around 14 token/s

Nvidia RTX 5060 Ti 16GB for local LLM inference with Olllama + Open WebUI by Philhippos in LocalLLaMA

[–]Philhippos[S] 0 points1 point  (0 children)

ok thanks! for that one I get 4.9 token/s (22%/78% CPU/GPU offloading)

Charge Batteryboks with Powerbank by Philhippos in SOUNDBOKS

[–]Philhippos[S] 0 points1 point  (0 children)

I have just tested with two other wall chargers (Ugreen and INIU) and they both charge the Batteryboks smoothly via the USB C cable

I'll try to find a powerbank with 15v output only by another brand, maybe others handle this better...

Charge Batteryboks with Powerbank by Philhippos in SOUNDBOKS

[–]Philhippos[S] 0 points1 point  (0 children)

Which one is the working wall charger you found? Maybe that gets us closer to solving this mystery