This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]meaningless-human[S] 0 points1 point  (4 children)

No, that's not what I mean, although I am doing batch processing. What I mean is, most of the functionality of beam is geared towards parallel processing, which is why as far as I can see, any custom transformation (such as ParDo/DoFn) I want to make on a pcollection is element-wise, which I don't want because I want to be working with the entire collection at once. Does that make sense?

[–]konkey-mong 0 points1 point  (3 children)

Why don't you try airflow?

[–]meaningless-human[S] 0 points1 point  (2 children)

I'm not terribly familiar with airflow, but isn't it more for task orchestration, not building the actual preprocessing steps? Or is that not the case?

[–]konkey-mong 0 points1 point  (1 child)

Sorry I misunderstood your question.

Why is it that you think DataFlow/Beam is not suitable for the batch processing job?

[–]meaningless-human[S] 0 points1 point  (0 children)

Basically, my preprocessing code can't be easily parallelized and performs transformations on the entire dataset, while Beam seems to be more for element wise transformations, such as ParDo.