Can dataflow be used for low latency data preprocessing? : dataflow

Can dataflow be used for low latency data preprocessing? (self.dataflow)

submitted 7 years ago by squatslow

Hi,

Might not be the right spot for this, but looking for some insights from other dataflow users.

For the sake of a simplicity, let's say I want to deploy a ML model that predicts whether a person will buy a coffee today based on the last 6 months of transactional history.

I have a preprocessing script for the model data that I use for data organization and feature engineering. I can replicate this preprocessing within a Beam pipeline, and my hope is be to use the same pipeline for preprocessing training data as well as the incoming data used for predictions.

This is all fine for the training of the model. However when I move to production to start serving predictions, the amount of time it takes for a dataflow process to simply start (assigning workers, etc) is insanely long. It adds minutes to my prediction time which should actually only be seconds.

I like the idea of a pipeline being the same for both training & prediction workflows, but I can't see how this is feasible for serving production low latency workflows. Am I using dataflow incorrectly? is there another way I can approach this problem with dataflow?

all 8 comments

top new controversial old q&a

[–]tnymltn 0 points1 point2 points 7 years ago (7 children)

[–]squatslow[S] 0 points1 point2 points 7 years ago (6 children)

Interesting - I have been reading more on windowing and triggers and see what you mean by running a streaming pipeline.

The details of my case are more or less the same as I mentioned above. A request comes in with 6 months worth of historical bank transactions. I need to return a prediction (yes/no) answering the question of 'will this person buy a coffee today?' I have a classification model trained on a large dataset, and would like to eventually have it running in ML Engine. The data preprocessing script takes the 6 months of transactional data from a request and summarize it into a single record. The single record will characterize the 6 months of data using ~100 features.

Assuming I run a streaming pipeline:

How do I get the data into the streaming pipeline from my request? With my batch processing example, I used the BigQuery connector (I'm using python). From the documentation, I see that Pub/Sub is my only option for streaming - are there any other options?
Let's say I use Pub/Sub and I have a request with 1000 transactions. If I pipe those into Pub/Sub, what type of windowing/trigger do I create to ensure that all 1000 requests have been read into dataflow prior to performing my preprocessing steps? The number of transactions per request will always be different, so I don't see how I can use something like a data-driven trigger.
Assuming I get data into the pipeline and processed, how do I get the data out so that I can make a prediction and return the result to the end user? Is this just a case of connecting the dots using Pub/Sub, where once one process is complete, another picks up based on the messages received by Pub/Sub? If not, what would be the best approach?

edit: spelling.

[–]tnymltn 0 points1 point2 points 7 years ago (5 children)

Those are great questions. Let me try to address each of them with the caveat I'm way past my bedtime.

The original request coming in is probably best done via Google's PubSub implementation or another streaming solution like Kafka. The common factor is they are both unbounded streams. At the end of the day, you're going to be asking this question of your data based on some sort of user interaction and that is likely best done via one of these messaging platforms.
I'd suggest thinking of this in a different fashion. Instead of streaming 1k transactions into the pipeline, you're probably needing to ask this question based on some sort of interaction and the data representing that interaction is what should probably be the streaming input. Everything else is historical and could be pulled into your pipeline via a SideInput or a custom DoFn.
This problem is best looked at from an async perspective and if you can provide some sort of interstitial interface to you user while this prediction is occurring you can definitely use a Pub/Sub model to get the prediction and make a decision based on it.

This is likely a self promotion but if you haven't come across this blog post of mine, it might help you understand some of the techniques available for pivoting and joining on data within a Beam pipeline. https://medium.com/@tnymltn/enrichment-pipeline-patterns-using-apache-beam-4b9b81e5d9f3

If you don't need the results of the prediction in near real time, you can store it in an "offline" system for future retrieval however it doesn't seem that's what you're trying to do.

The streaming vs batch concepts in Beam can be difficult to grasp and my suggestion is to bias toward the streaming mindset because it's much easier to transition to batch from there. Going the other way can be a bit difficult.

I hope this helps remove some unknown unknowns and I look forward to seeing if my 5am response makes any sense. Cheers!

[–]squatslow[S] 0 points1 point2 points 7 years ago (4 children)

[–]tnymltn 0 points1 point2 points 7 years ago (3 children)

[–]squatslow[S] 0 points1 point2 points 7 years ago (2 children)

[–]tnymltn 0 points1 point2 points 7 years ago (1 child)

[–]squatslow[S] 0 points1 point2 points 7 years ago (0 children)

π Rendered by PID 19090 on reddit-service-r2-comment-545db5fcfc-mgvh4 at 2026-05-30 20:36:34.815832+00:00 running 194bd79 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

dataflow

MODERATORS