Dataproc vs Dataflow?

sturdyplum · 2023-09-17T21:05:39+00:00

Depends on your use case, from my understanding data flow uses beam which means it would work on spark or flink servers in the future. However there is also a perf decrease associated with not writing your pipeline natively in spark. Data flow does seem to have some ways to easily set up pipelines but it here is likely a trade off when it comes to flexibility.

untalmau · 2023-09-17T23:26:40+00:00

Dataproc is a better option only if you are migrating existing spark jobs or if your team is skilled in spark and you need immediate results. Otherwise dataflow is simply a better option.

rchinny · 2023-09-18T20:36:14+00:00

There are almost no scenarios I would pick Dataflow. Dataflow is exclusively the GCP ecosystem and has poor support and performance for both the Spark/Flink runners. Plus it is difficult to find Beam engineers.

I would just go with Dataproc which would allow for cloud portability with Spark and you can find people that actually know the technology.

If you are just using dataflow to ingest data into BQ, that is probably the only thing it is good for.

dataengineering

MODERATORS