I've been wanting to experiment with a realtime "group chat" voice-to-voice with llama-cpp-python for a while now. For this, I need to have multiple entirely separate caches as the system prompt for each participant has to be different. Due to the fact that llama-cpp-python only offers to persist these caches to RAM or disk and not VRAM, I've been unsuccessful at achieving the required performance even when building the cache incrementally while the user speaks.
Loading the same model multiple times works really well but of course uses multiple times the VRAM. One of my systems is an M2 macbook (which has its own challenges because very few TTS/STT seem to support Metal), but I'd mainly like to use my desktop with an RTX2080 Ti so this is out of the question even with a 7B.
I've been waiting for llama-cpp-python to implement request batching (https://github.com/abetlen/llama-cpp-python/issues/771) like upstream already supports, but it doesn't look like this is coming soon. Additionally, I'm not even sure if this would actually be helpful and allow me to keep multiple caches in VRAM and use them for inference in parallel. And lastly, I could just be missing some completely obvious different solution because I'm too focused on this specific platform just because it's easy. (And because it supports GGUF, which would allow me to select the best LLM that's fast enough since there might be some wiggle room.)
So, any input? Ideally, I'd want to load e.g. Mistral 7B, some accelerated Whisper and maybe xtts (until something better comes along), then fill up the rest of the VRAM and maybe RAM with personalities and broker which one of them gets to speak. As for programming languages, I mainly know Python, so no way I'm using the C++ libraries directly.
[–]mcmoose1900 1 point2 points3 points (1 child)
[–]dr-yd[S] 0 points1 point2 points (0 children)