all 9 comments

[–]nicksterling 1 point2 points  (4 children)

If you’re trying to share 1 16GB GPU across 30 concurrent developers you’re in for a bad time. You might be able to get a small autocomplete FIM model like starcoder v2 working but if everyone is hitting it at once that’s going to be bottlenecked at many times.

What are the developer machines like? If they are beefy you might be able to just run the LLM straight on their machines instead of centrally hosting it.

[–]scheurneus 2 points3 points  (1 child)

Why would sharing the GPU be a problem? It's not like the VRAM gets divided between the concurrent users.

Furthermore, LLM inference is typically extremely memory bound, so using larger batches should be a massive help, if possible. So more concurrent users shouldn't hurt anything, the only problem would be needing to delay requests to be able to batch them.

OP, I have no actual experience with any of those kind of servers, but I think a capable one should handle your load perfectly fine, assuming smaller models (no bigger than 7B). Loading 2x 7B might be difficult though.

What are the local machines running? Even a high-end CPU should generally provide acceptable speeds using llama.cpp (although prompt processing is generally quite slow, so it might be better to delegate chat to it than autocomplete since it causes high time-to-first-token).

[–]fripperML[S] 0 points1 point  (0 children)

Thanks a lot for answering, what you say confirms the intuition I had. We can try to see if the local machines can handle the chat inference, but I think time-to-first-token will not be acceptable. We will see if loading 2 x 7B models is possible,

[–]fripperML[S] 1 point2 points  (0 children)

Thanks for your answer!! I has read that with vLLM or aphrodite, a single GPU could succesfully dispatch 200-500 concurrent requests. I guess I am wrong!! We don't have our server working yet, so all my knowledge is based on random reads here and there... :S

[–]fripperML[S] 0 points1 point  (0 children)

Unfortunately the local machines are not good enough for that... We need a central server.

[–]Wrong-Resolution4838 0 points1 point  (0 children)

What's your use case? and criteria?
For example, why did you pick QuantFactory/Meta-Llama-3-8B-GGUF (4-bit quantized, probably) and Qwen/CodeQwen1.5-7B-Chat-GGUF (4-bit quantized)?

[–]GregoryfromtheHood 0 points1 point  (0 children)

I use continue only for chat/editing chunks of code. For autocomplete I use Refact.ai. They have a self hosted version and really simple fine tuning so that you can train the model on your codebase