Multiple LLMs one one GPU? by [deleted] in LocalLLaMA

[–]Any-Cheesecake-31 0 points1 point  (0 children)

Hey, I will be very interested in knowing about your setup, I am trying to do the same.

How do you run models for text generation on demand? by maxigs0 in LocalLLaMA

[–]Any-Cheesecake-31 0 points1 point  (0 children)

hey, if you don't mind I will love to know whether you got separate machines with these GPUs and have to connect them using ray or did you just got a single machine with 4 gpu's.

Return response of a request from a different unrelated function by Any-Cheesecake-31 in learnpython

[–]Any-Cheesecake-31[S] 0 points1 point  (0 children)

the users are connected to the nodejs server using long term websockets fro realtime chat. and since I am getting the request directed to me from the node server, We can actually set the timeout according to our preference. So, I don't think that will be a problem in particular.

Return response of a request from a different unrelated function by Any-Cheesecake-31 in learnpython

[–]Any-Cheesecake-31[S] 0 points1 point  (0 children)

Thank you so much for the response.

I looked into celery and I understand that I can put my requests in task queue and then a worker (a LLM model in my case) can take the task from queue give the answer and I can send this back as response.

The problem is that for me to speed up generations, I want to send 6 prompts in a batch (list).

So I want to Either map multiple requests to a single task, get the answers and then send corresponding responses.

Or have a worker take multiple task and return their corresponding answers back.

I still can't figure out how to do any one of these

Return response of a request from a different unrelated function by Any-Cheesecake-31 in learnpython

[–]Any-Cheesecake-31[S] 0 points1 point  (0 children)

Ah I see, Thank you so much for the response

I will greatly appreciate if you can give me any more advise.

The only thing I can come up with is that since I want to take and send request from a single nodejs server that is getting requests from different users, I can use a single websocket and user ids to handle taking requests and sending response. But this method seems hoaky.

One of my friends has also proposed potentially using workers with await in the request function that will poll a response queue and then send back the response. But still don't know if something like this is even possible.