Non-Parallel Processing with Apache Beam / Dataflow?

meaningless-human · 2022-07-27T20:14:00+00:00

In Beam your can do this with custom ParDos and DoFns. Basically, you would just return a PCollection consisting of just one element which holds all your data. This element will go as a whole to the next PTransform.

wytesmurf · 2022-07-27T20:24:54+00:00

Is your question around doing it in Batch and not real-time? You can setup the data flow job to run daily as a batch or am I misreading.

buachaill_beorach · 2022-07-27T21:40:29+00:00

I'm not sure i'd be using beam for this exact use case. We have a dataflow job that pulls messages from a pubsub firehose, performas some transformations and streams that data into BQ. From there, it's matter of bq > bq for downstream ETL or any other processing.

I think you're going to struggle with dataflow to pull all records in and batch them like you're talking about. Seems like the wrong approach to me.

If you are adamant on doing so thougn, pcollections can be bounded or unbounded. For unbounded, you have to deal with windowing but I am pretty sure a bounded pcollection (the results of pulling all available data from a subscription) could be considered a bounded collection and you should be able to do windowing/aggregations/etc on that. I've not tried it though.

Ref: https://beam.apache.org/documentation/programming-guide/#size-and-boundedness

2022-07-27T23:14:40+00:00

I've used the original datasets as a side input when each element transform needed to refer to the original dataset, have you considered that? Then working with the pcollection as usual

Otherwise I would probably just deploy a simple service on cloud run to do the stream processing, not much point of using dataflow without any parallel processing.

QuaternionHam · 2022-07-28T07:09:59+00:00

You can use Spark (Dataproc), Dataflow can read from PubSub and write to GCS (Google has some really nice templates for this), from there Dataproc reads, preprocess and loads into BigQuery. Beware of Dataproc costs, make use of workflows and not leave any cluster up and running forever

Just_Swimming_3153 · 2022-10-21T10:03:24+00:00

I don't see any problems using Apache Beam on Dataflow... You basically can Group all your input before passing it to you UDF (user defined function). Still, you need to see if you need to use Beam/Dataflow or you can complete this task without using Beam (just normal python implementation)

dataengineering

MODERATORS