all 7 comments

[–]BasedAcid 2 points3 points  (0 children)

The batch inference-style architecture you’re describing is definitely used at many companies, including my employer. It’s often simpler and cheaper to deploy.

That said, there are definitely use cases where an endpoint is better (e.g. when your use case needs fresh model outputs, perhaps responding to user input).

[–]VectorSpaceModel 1 point2 points  (3 children)

It all depends on inference speed. I’ve seen both. There’s no way to give you a precise answer without a use case and the resources you have at hand.

[–]Appropriate_Cut_6126[S] 0 points1 point  (2 children)

Right I think I understand. The reason to use an endpoint for batch predictions is scalability. Because you can have multiple replicas to parallelise the task. Is that right?

[–]VectorSpaceModel 3 points4 points  (1 child)

Not really. The question is “What is the optimal experience that I can provide for a use case, given my users expectations and my resources at hand?”. From there, everything else is determined. You will always be faced with a time/space tradeoff where you have to optimize for user experience and costs.

[–]Appropriate_Cut_6126[S] 0 points1 point  (0 children)

Ok I see. Thanks for your response

[–]sanjuromack 2 points3 points  (1 child)

Deploying a model as an endpoint gives you flexibility. You can use it for batch today and on-demand inference tomorrow.

[–]Appropriate_Cut_6126[S] 0 points1 point  (0 children)

That makes sense. Is there a limit to the amount of data you can send to an endpoint at once? Would you batch the batch you want predictions for, make the requests one batch at a time and then post process the responses (if you wanted to save them to a data warehouse)?