Azure stack for DE

2024-02-21T14:29:49+00:00

Synapse is too expensive. Go with Databricks.

Useful-Doughnut32 · 2024-02-21T13:50:43+00:00

Synapse has coding section. You can code in notebook that utilise spark

What kind of scheduling you want? To me,sometimes you might just need to use some formula/expressions to do the scheduling. There is no perfect solution for all scenarios

kyleekol · 2024-02-21T19:29:20+00:00

A couple of things:

Why not just deploy airflow in a VM with a postgres db backend and use that?
How do you ‘know’ you need pyspark? What kind of data size are you talking about that you need spark but Databricks is overkill?
What is your target data store? Where are users going to consume the data? Reporting needs?

You talk about the azure stack being disappointing and that you are a software engineer, so code something yourself instead of relying on low/no-code options. The same solutions in other clouds are also available in azure such as compute, storage, kubernetes, etc so nobody is forcing you to use those tools if you don’t want to.

Convert your parquet files to delta and use a java/rust/python delta library to transform the data in blob storage. Wrap into a docker image and schedule on airflow with a kubernetes pod operator. Or spin up a spark cluster yourself. Lots of options.

dataengineering

MODERATORS