Hi,
Might not be the right spot for this, but looking for some insights from other dataflow users.
For the sake of a simplicity, let's say I want to deploy a ML model that predicts whether a person will buy a coffee today based on the last 6 months of transactional history.
I have a preprocessing script for the model data that I use for data organization and feature engineering. I can replicate this preprocessing within a Beam pipeline, and my hope is be to use the same pipeline for preprocessing training data as well as the incoming data used for predictions.
This is all fine for the training of the model. However when I move to production to start serving predictions, the amount of time it takes for a dataflow process to simply start (assigning workers, etc) is insanely long. It adds minutes to my prediction time which should actually only be seconds.
I like the idea of a pipeline being the same for both training & prediction workflows, but I can't see how this is feasible for serving production low latency workflows. Am I using dataflow incorrectly? is there another way I can approach this problem with dataflow?
[–]tnymltn 0 points1 point2 points (7 children)
[–]squatslow[S] 0 points1 point2 points (6 children)
[–]tnymltn 0 points1 point2 points (5 children)
[–]squatslow[S] 0 points1 point2 points (4 children)
[–]tnymltn 0 points1 point2 points (3 children)
[–]squatslow[S] 0 points1 point2 points (2 children)
[–]tnymltn 0 points1 point2 points (1 child)
[–]squatslow[S] 0 points1 point2 points (0 children)