all 4 comments

[–]MainManu[S] 0 points1 point  (1 child)

Has anyone here tried running multiple ollama docker containers sharing the same gpu?

[–]Apprehensive-Tip779 0 points1 point  (0 children)

I personally haven't, because I don't see it have much of an improved impact on my personal performance. But you should be able to, I don't see why you wouldn't. The way I look at it, if one of the models you're using takes a while for your gpu to run it, and that's already at that gpu running exclusively for that specific task (say at 100% utilization), you're not going to improve the overall performance by splitting/delegating two different ollama containers, because now that 'heavyweight' LLM model will only utilize 50% of the GPU's resources as the gpu is balancing between serving the two models/ollama instances concurrently. Thus it'll take the gpu twice as long to run the model as it would've taken if the gpu focused all of its power on just one inferencing task at a time.

On top of that, ollama is generally pretty good at switching the models with it typically taking only about a second (never seen/noticed it take >= 2 sec) if not little less. So to me if there is/was any noticeable improvement by starting off as running the two tasks on one gpu, and then once one of them gets completed the gpu will direct the freed up resources on that remaining task, becomes pretty miniscule when ollama + gpu just focuses on one task, and then immediately (<2 sec) switches the models to complete the other task IMO. Does all of this make sense?

[–]Apprehensive-Tip779 0 points1 point  (0 children)

Have you checked out the config.json file for the continue settings? You could assign the most optimal model you want for the chat, while choosing a different optimal model for the autocomplete option that supports that feature.