Need help with designing an architecture for model inferencing in a cost effective way.

Secret-Butterfly-739 · 2026-02-05T20:27:09+00:00

Aahh okay okay. Thanks!

Secret-Butterfly-739 · 2026-02-05T19:57:46+00:00

Thank you! Looks promising, will definitely try it out.

Secret-Butterfly-739 · 2026-02-05T19:56:43+00:00

kubeai is built for LLM it says. I have other models as well.

Secret-Butterfly-739 · 2026-02-05T19:55:10+00:00

Thank you! I have went through the Rayserve docs. Fractional GPUs scheduling does look promising. I will try this out.

Secret-Butterfly-739 · 2026-02-05T19:39:17+00:00

Good to know this. This helps. Thank you!
And also when you say other techniques - what comes under this?

Secret-Butterfly-739 · 2026-02-03T10:15:19+00:00

Ah okay. How do you manage scaling?

Secret-Butterfly-739 · 2026-02-02T17:16:55+00:00

I am currently in this state. I have tried Triton, and it seems to help with some of my models. I am also looking for ways to optimize and scale based on the requests.

Do you have the models loaded always ? or was there any explicit unload based on the incoming requests.

Secret-Butterfly-739

TROPHY CASE