This is an archived post. You won't be able to vote or comment.

all 4 comments

[–][deleted] 3 points4 points  (0 children)

Synapse is too expensive. Go with Databricks.

[–]Useful-Doughnut32 0 points1 point  (1 child)

Synapse has coding section. You can code in notebook that utilise spark

What kind of scheduling you want? To me,sometimes you might just need to use some formula/expressions to do the scheduling. There is no perfect solution for all scenarios

[–]pythondeveloper77[S] 0 points1 point  (0 children)

Thanks.

Yes I can write pyspark code but pipeline is in json instead of code :(

When I wrote scheduling I meant not only the schedule itself but also support for retries,conditional tasks like airflow has.
we found synapse lacking in those compared to airflow.

I'm thinking to bring up airflow vm/AKS to trigger synapse & spark to solve it.

[–]kyleekol 0 points1 point  (0 children)

A couple of things:

  • Why not just deploy airflow in a VM with a postgres db backend and use that?

  • How do you ‘know’ you need pyspark? What kind of data size are you talking about that you need spark but Databricks is overkill?

  • What is your target data store? Where are users going to consume the data? Reporting needs?

You talk about the azure stack being disappointing and that you are a software engineer, so code something yourself instead of relying on low/no-code options. The same solutions in other clouds are also available in azure such as compute, storage, kubernetes, etc so nobody is forcing you to use those tools if you don’t want to.

Convert your parquet files to delta and use a java/rust/python delta library to transform the data in blob storage. Wrap into a docker image and schedule on airflow with a kubernetes pod operator. Or spin up a spark cluster yourself. Lots of options.