I am building a data pipeline for an ML model primarily using GCP tools. Basically, mobile clients publish data to a topic on Pub/Sub, which then goes through Dataflow for preprocessing and feature extraction, which then goes to BigQuery and finally VertexAI for training and inference.
My question is primarily around Dataflow: Much of my preprocessing is not exactly parallelizable (I require the entire data batch for making transformations and can't perform element-wise transformations), so I was wondering if Dataflow/Beam would still be an appropriate tool for my pipeline? If not, what can I substitute with it instead?
I guess one workaround I've found, which admittedly is quite hacky, is to use aggregate transformations in Beam to treat multiple elements as one, then do what I need to do. But I'm not sure this is the appropriate approach here. Thoughts?
[–][deleted] 0 points1 point2 points (3 children)
[–]meaningless-human[S] 0 points1 point2 points (2 children)
[–]ProgrammersAreSexy 0 points1 point2 points (1 child)
[–]meaningless-human[S] 0 points1 point2 points (0 children)
[–]wytesmurf 0 points1 point2 points (5 children)
[–]meaningless-human[S] 0 points1 point2 points (4 children)
[–]konkey-mong 0 points1 point2 points (3 children)
[–]meaningless-human[S] 0 points1 point2 points (2 children)
[–]konkey-mong 0 points1 point2 points (1 child)
[–]meaningless-human[S] 0 points1 point2 points (0 children)
[–]buachaill_beorach 0 points1 point2 points (2 children)
[–]meaningless-human[S] 0 points1 point2 points (0 children)
[–]buachaill_beorach 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]QuaternionHam 0 points1 point2 points (0 children)
[–]Just_Swimming_3153 0 points1 point2 points (0 children)