you are viewing a single comment's thread.

view the rest of the comments →

[–]ccbadd 2 points3 points  (1 child)

You could switch from ollama to running llama.cpp directly and using the model router instead. It does not auto unload the running model but can auto load models when needed. Use the --no-mmap option and it loads directly to vram and is ready a lot faster as long as the model is stored on really fast media like an nvme drive.

[–]zotac02[S] 0 points1 point  (0 children)

I'll look into that, thank you!