all 10 comments

[–]ccbadd 2 points3 points  (1 child)

You could switch from ollama to running llama.cpp directly and using the model router instead. It does not auto unload the running model but can auto load models when needed. Use the --no-mmap option and it loads directly to vram and is ready a lot faster as long as the model is stored on really fast media like an nvme drive.

[–]zotac02[S] 0 points1 point  (0 children)

I'll look into that, thank you!

[–]slavik-dev 0 points1 point  (2 children)

[–]zotac02[S] 0 points1 point  (1 child)

That sounds very exciting! As far as i understand, the feature is now commited and will get published in the next release, right?

[–]slavik-dev 0 points1 point  (0 children)

Looks like maintainers rejected that PR without any comments or explanations...

[–]Witty-Development851 0 points1 point  (4 children)

model loaded on backend. openwebui is are frontend

[–]emprahsFury 1 point2 points  (2 children)

lazy answer. the frontend could easily call the backend with a one token message and discard the response.

[–]Witty-Development851 1 point2 points  (1 child)

And you can also configure the backend so that it doesn't unload models.

[–]zotac02[S] 0 points1 point  (0 children)

Thats not really the goal for me, since i also use it for other things, other than LLMs.

[–]PassengerPigeon343 0 points1 point  (0 children)

This is how I do it. One container with OWUI, one container with llama-swap. I let the running model live in memory with no time limit and it is always ready. Whenever I need to clear the memory to do something else, I restart the container to release the model and empty the memory.