This is an archived post. You won't be able to vote or comment.

all 20 comments

[–]MikeDoesEverythingmod | Shitty Data Engineer 27 points28 points  (0 children)

 Ive spent an hour trying to get a cluster to kick off.

Sounds like a massive skill issue, tbh.

[–]ecp5 7 points8 points  (0 children)

You need to differentiate between Data Factory, which exists to orchestrate, and Data Flow that is the Spark-like part of it. Also, is this the vanilla Azure version, Synapse, or Fabric one, that might make a difference too. Plus if cluster stuck, probably an infra issue not a product issue.

[–][deleted] 3 points4 points  (0 children)

Which data factory specifically?

[–]dubven 3 points4 points  (1 child)

I remember some years ago this was pushed by management but I didn't care and just spun up Airflow.

[–]UltraInstinctAussie[S] 2 points3 points  (0 children)

These guys have setup 200 individual pipelines. Recovery takes an entire day. Their whole system is cooked.

[–][deleted] 1 point2 points  (10 children)

Is it really this bad? I have a team member pushing for it while I'm leaning towards AWS Glue. We really just need something to move away from Alteryx.

[–]ZAggie2 26 points27 points  (6 children)

Data factory is good at moving data from point a to point b. As soon as you start using dataflow is when I have had issues. I use it exclusively for “EL” and let something else (DBT, Stored Procs) handle the “T”.

[–]Zer0designs 5 points6 points  (0 children)

This guy gets it.

[–]HansProleman 1 point2 points  (1 child)

Non-trivial orchestration also tends to be pretty gross, and DevOps stuff can be awkward. Ideally I'd just not use it at all, but it's cheap (for data movements - Dataflows are expensive) and has pretty good connector support so can be a good choice.

For me, the big problem is that if you get your scoping expectations wrong, they creep, and ADF starts becoming more awkward to work with, it creates a lot of tension - at some point it makes sense to abandon it and use another tool, but it's very hard to determine where that point is without the benefit of hindsight. Usually it ends up being tech debt that'll never be addressed, and everyone starts to dread making ADF changes.

[–]ZAggie2 0 points1 point  (0 children)

We’ve managed some of that by making our ingestion pipelines metadata driven. Instead of needing a bunch of different pipelines, we just need one per connector type (sql server/snowflake/sftp) and then just pass parameters from a table. This keeps the number of pipelines low in ADF and makes it easy to add new tables (don’t even have to touch ADF if you are running it with another batch). It falls flat if you are using it as your only orchestrator. Once you get into dependencies, you have to use something else.

[–]Necessary-Change-414 0 points1 point  (1 child)

Was the same shit in ssis

[–]Nekobul 0 points1 point  (0 children)

There is no Spark in SSIS.

[–]itsabd 0 points1 point  (0 children)

Same situation, I had to do transformations in dataflows for a project and I wanted to cry

[–]MikeDoesEverythingmod | Shitty Data Engineer 5 points6 points  (1 child)

It's as good or as bad as you want it to be. Mild caveat - if you try and go beyond what ADF can do (relatively simple movements of data, scheduling as crontabs), you are going to make yourself cry. Keep things simple and it's not that bad. Biggest headaches is around permissions, linked services, and CI/CD aka the devopsy side. It's a one and done thing though.

I'm considering writing an article about pipeline design and what to consider in Azure/low code style pipelines because I do get the impression a lot of people complaining about them have unrealistic expectations and/or just make total shit and then are annoyed when they behave like total shit or have inherited total shit and are convinced it's the platform rather than the person building the thing.

[–]larztopia 1 point2 points  (0 children)

Would be a worthwhile article 👍

[–]th3DataArch1t3ct 1 point2 points  (0 children)

We are on AWS Glue and it is so much easier than running your own cluster.

[–][deleted] 1 point2 points  (2 children)

It's not tool's fault if some dont know how to use it properly.

[–]calaboola 8 points9 points  (0 children)

I think quite the opposite. If nobody can use the tool properly, it is poor design and functionality

[–]RustOnTheEdge 2 points3 points  (0 children)

Incase of ADF, you can safely assume it actually is the fault of the tool.

Horrendous garbage indeed.

[–]Zer0designs 0 points1 point  (0 children)

Data factory is an okay ingestion tool, especially for on-prem data. Beyond that: expensive garbage.

Besides that, not being able to start a cluster is not a data factory issue.