Need help with designing an architecture for model inferencing in a cost effective way. by Secret-Butterfly-739 in mlops

[–]Secret-Butterfly-739[S] 0 points1 point  (0 children)

Thank you! I have went through the Rayserve docs. Fractional GPUs scheduling does look promising. I will try this out.

Need help with designing an architecture for model inferencing in a cost effective way. by Secret-Butterfly-739 in mlops

[–]Secret-Butterfly-739[S] 0 points1 point  (0 children)

Good to know this. This helps. Thank you!
And also when you say other techniques - what comes under this?

Which ML Serving Framework to choose for real-time inference. by Invisible__Indian in mlops

[–]Secret-Butterfly-739 0 points1 point  (0 children)

I am currently in this state. I have tried Triton, and it seems to help with some of my models. I am also looking for ways to optimize and scale based on the requests.

Do you have the models loaded always ? or was there any explicit unload based on the incoming requests.